You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Derek Poh <dp...@globalsources.com> on 2014/06/09 09:00:48 UTC

Does solr 4.8.1 support these features?

My company is actively looking at alternative search engine applications 
to replace our current Endeca application.

I have no experience and knowledge on Solr and Lucene.
Please bear with me, I would like to find out if the following features 
are available on Solr.

1. Aggregate results (rollups).
Eg. Froma list of search result of products (each has field = supplier 
id), can the results be aggregated by supplier id with the original 
results ordering retain.

2. Filter/Navigator, counts.
List out a field's possible values and their counts fromthe indexed data 
and from the return results.
The field's values can be sorted by the values description or by the 
values countsin the return results.

Eg. Field "Business Type" belowwith it's possible values and the count 
for each value(in bracket). Can the field be return in the result with 
it's values sorted either by description or bycounts?
Business Type
Manufacturer (15269)
     Exporter (12493)
     Trading Company (5541)
     Agent (1324)
     Wholesaler (1202)
     Importer (682)
     Buying Office (394)
Distributor (278)
     Other (157)
     Retailer (116)
     Consultant (54)

3. Configureand defined the relevance rankingand matching logic of the 
return result.

4. Defined and configure the thesaurus (1-wayor 2-way), stemming and 
stop words.

5. Multi-language supportfor Simplified Chinese and Spanish.

6. Scalability.
At present, we are indexing 4million recordsand the number is expected 
to increase by more than 10 folds in the near future.

7. Search results debugging. Eg. why record was matchedor why record was 
ranked as such.

Derek

----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Does solr 4.8.1 support these features?

Posted by Derek Poh <dp...@globalsources.com>.

Heh no worries, am glad to have feedbacks to my queries.

Will check out the referencelinks.

Still trying to understand how different solr (structure and design) is 
compared to our current search application.


On 6/11/2014 12:53 AM, Phanindra R wrote:
> I don't mean to hijack.
>
> Yes, there are two ways.
>
> 1) Index time field boosting : Please note that it is like hard-coding
> those boosts into the index. If you want to change boosting for a field,
> you will have to re-index.
>
> 2) Query-time (field-level) boosting: This is more flexible. Achieves
> exactly same as above. I don't think it introduces any significant
> performance impact.
>
>      When it comes to Lucene/Solr, you always specify the field name along
> with the keyword as in fieldName:keyword(s), which is the atomic unit
> that's searchable in Lucene/Solr. In this case, you just have to provide
> boost as well as shown in the following link.
>
> References
> http://www.solrtutorial.com/solr-search-relevancy.html
> https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents
>
>
>
>
> On Tue, Jun 10, 2014 at 12:26 AM, Derek Poh <dp...@globalsources.com> wrote:
>
>> Hi Mark
>>
>> Appreciate you taking the time to reply and with references.
>>
>> Regarding 3. Configure and defined the relevance ranking and matching
>> logic of the return result.
>>
>> Can each search handler be configure to
>> - search on a few fields
>> - assign a numeric rank to each of the field, such that a match on a field
>> with the highest rank will rank the document higher in the return search
>> result.
>> - the ranking of each field will also act as tie-breaker.
>> Eg.
>> Category = 3
>> SPPKeyWord= 2
>> KeySpecification= 1
>>
>> Document that has match on field Category will be ranked higher in the
>> result than document that has match on SPPKeyWord.
>> Document that has match only on field KeySpecification willrank the lowest
>> in the result.
>>
>>
>>
>> On 6/10/2014 12:27 AM, Mark Bennett wrote:
>>
>>> Hello Derek,
>>>
>>> See answers inline.
>>>
>>> --
>>> Mark Bennett / LucidWorks: Search & Big Data /
>>> mark.bennett@lucidworks.com
>>> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>>>
>>> On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:
>>>
>>>   My company is actively looking at alternative search engine applications
>>>> to replace our current Endeca application.
>>>>
>>>> I have no experience and knowledge on Solr and Lucene.
>>>> Please bear with me, I would like to find out if the following features
>>>> are available on Solr.
>>>>
>>>> 1. Aggregate results (rollups).
>>>> Eg. Froma list of search result of products (each has field = supplier
>>>> id), can the results be aggregated by supplier id with the original results
>>>> ordering retain.
>>>>
>>> Yes it can:
>>> http://wiki.apache.org/solr/FieldCollapsing
>>>
>>>   2. Filter/Navigator, counts.
>>>> List out a field's possible values and their counts fromthe indexed data
>>>> and from the return results.
>>>> The field's values can be sorted by the values description or by the
>>>> values countsin the return results.
>>>>
>>> Yes, Solr calls these "Facets" and offers several types:
>>> http://wiki.apache.org/solr/SimpleFacetParameters
>>> http://wiki.apache.org/solr/HierarchicalFaceting
>>>
>>>   Eg. Field "Business Type" belowwith it's possible values and the count
>>>> for each value(in bracket). Can the field be return in the result with it's
>>>> values sorted either by description or bycounts?
>>>> Business Type
>>>> Manufacturer (15269)
>>>>      Exporter (12493)
>>>>      Trading Company (5541)
>>>>      Agent (1324)
>>>>      Wholesaler (1202)
>>>>      Importer (682)
>>>>      Buying Office (394)
>>>> Distributor (278)
>>>>      Other (157)
>>>>      Retailer (116)
>>>>      Consultant (54)
>>>>
>>> Absolutely, and Solr is very fast and accurate.
>>>
>>>   3. Configureand defined the relevance rankingand matching logic of the
>>>> return result.
>>>>
>>> Yes, though not by that name.
>>> Step 1:
>>> Configure default edismax parameters in your solrconfig.xml
>>>
>>> Step 2:
>>> Create additional search handlers in solrconfig.xml, and each search
>>> handler can have its own edismax configuration.
>>>
>>> Normally the format of the search URL is:
>>>       http://localhost:8983/solr/collection_name/select?q=text:budget
>>>
>>> You would replace the "select" with the name of the search handler that
>>> has the edismax config you want.
>>>
>>> With multiple search handlers, you'd use something like:
>>>       http://localhost:8983/solr/collection_name/search_
>>> freshest?q=text:budget
>>>       http://localhost:8983/solr/collection_name/search_most_
>>> popular?q=text:budget
>>>
>>>   4. Defined and configure the thesaurus (1-wayor 2-way), stemming and
>>>> stop words.
>>>>
>>> Yes, Solr is very good about this, you have both options.
>>>
>>> Also, Solr let's you choose:
>>> * Index time, or query time, or both
>>> * Use expansion or reduction
>>>
>>> You can even have more than one thesaurus file and have them each handled
>>> differently.
>>>
>>> For example:
>>> * Use an english_language thesaurus, which rarely changes, and expand
>>> that at index time
>>> * Use your company_synonyms, which may change frequently, and expand them
>>> at search time.
>>>
>>> I'll let you find these in the wiki, http://wiki.apache.org
>>>
>>>   5. Multi-language supportfor Simplified Chinese and Spanish.
>>> Yes!
>>>
>>> And for simplified Chinese, please make sure to use the SmartCN analyzer,
>>> and not the simplistic "CJK"; SmartCN actually looks for Chinese language
>>> word breaks using statistical methods, and therefore should give better
>>> results.
>>>
>>>   6. Scalability.
>>>> At present, we are indexing 4million recordsand the number is expected
>>>> to increase by more than 10 folds in the near future.
>>>>
>>> 40 million documents can normally be handled on a single machine,
>>> assuming it has enough RAM and doesn't have a lot of other stuff running.
>>> You might want a second machine for failover.
>>>
>>> When people use multiple machines, then the way to do that is via
>>> SolrCloud.
>>>
>>>   7. Search results debugging. Eg. why record was matchedor why record was
>>>> ranked as such.
>>>>
>>> Yes.
>>>
>>> You typically add &debugQuery=true&debug.explain.structured=true to the
>>> URL.
>>>
>>> The output is a bit technical, it takes some practice to understand.
>>>
>>> There's also a graphical relevancy debugger with a free eval period:
>>> http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
>>>
>>>   Derek
>>>> ----------------------
>>>> CONFIDENTIALITY NOTICE
>>>> This e-mail (including any attachments) may contain confidential and/or
>>>> privileged information. If you are not the intended recipient or have
>>>> received this e-mail in error, please inform the sender immediately and
>>>> delete this e-mail (including any attachments) from your computer, and you
>>>> must not use, disclose to anyone else or copy this e-mail (including any
>>>> attachments), whether in whole or in part.
>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>> regulatory compliance and/or other appropriate reasons.
>>>>
>>>
>>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or
>> privileged information. If you are not the intended recipient or have
>> received this e-mail in error, please inform the sender immediately and
>> delete this e-mail (including any attachments) from your computer, and you
>> must not use, disclose to anyone else or copy this e-mail (including any
>> attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal,
>> regulatory compliance and/or other appropriate reasons.
>>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Does solr 4.8.1 support these features?

Posted by Phanindra R <ph...@gmail.com>.

I don't mean to hijack.

Yes, there are two ways.

1) Index time field boosting : Please note that it is like hard-coding
those boosts into the index. If you want to change boosting for a field,
you will have to re-index.

2) Query-time (field-level) boosting: This is more flexible. Achieves
exactly same as above. I don't think it introduces any significant
performance impact.

    When it comes to Lucene/Solr, you always specify the field name along
with the keyword as in fieldName:keyword(s), which is the atomic unit
that's searchable in Lucene/Solr. In this case, you just have to provide
boost as well as shown in the following link.

References
http://www.solrtutorial.com/solr-search-relevancy.html
https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents




On Tue, Jun 10, 2014 at 12:26 AM, Derek Poh <dp...@globalsources.com> wrote:

> Hi Mark
>
> Appreciate you taking the time to reply and with references.
>
> Regarding 3. Configure and defined the relevance ranking and matching
> logic of the return result.
>
> Can each search handler be configure to
> - search on a few fields
> - assign a numeric rank to each of the field, such that a match on a field
> with the highest rank will rank the document higher in the return search
> result.
> - the ranking of each field will also act as tie-breaker.
> Eg.
> Category = 3
> SPPKeyWord= 2
> KeySpecification= 1
>
> Document that has match on field Category will be ranked higher in the
> result than document that has match on SPPKeyWord.
> Document that has match only on field KeySpecification willrank the lowest
> in the result.
>
>
>
> On 6/10/2014 12:27 AM, Mark Bennett wrote:
>
>> Hello Derek,
>>
>> See answers inline.
>>
>> --
>> Mark Bennett / LucidWorks: Search & Big Data /
>> mark.bennett@lucidworks.com
>> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>>
>> On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:
>>
>>  My company is actively looking at alternative search engine applications
>>> to replace our current Endeca application.
>>>
>>> I have no experience and knowledge on Solr and Lucene.
>>> Please bear with me, I would like to find out if the following features
>>> are available on Solr.
>>>
>>> 1. Aggregate results (rollups).
>>> Eg. Froma list of search result of products (each has field = supplier
>>> id), can the results be aggregated by supplier id with the original results
>>> ordering retain.
>>>
>> Yes it can:
>> http://wiki.apache.org/solr/FieldCollapsing
>>
>>  2. Filter/Navigator, counts.
>>> List out a field's possible values and their counts fromthe indexed data
>>> and from the return results.
>>> The field's values can be sorted by the values description or by the
>>> values countsin the return results.
>>>
>> Yes, Solr calls these "Facets" and offers several types:
>> http://wiki.apache.org/solr/SimpleFacetParameters
>> http://wiki.apache.org/solr/HierarchicalFaceting
>>
>>  Eg. Field "Business Type" belowwith it's possible values and the count
>>> for each value(in bracket). Can the field be return in the result with it's
>>> values sorted either by description or bycounts?
>>> Business Type
>>> Manufacturer (15269)
>>>     Exporter (12493)
>>>     Trading Company (5541)
>>>     Agent (1324)
>>>     Wholesaler (1202)
>>>     Importer (682)
>>>     Buying Office (394)
>>> Distributor (278)
>>>     Other (157)
>>>     Retailer (116)
>>>     Consultant (54)
>>>
>> Absolutely, and Solr is very fast and accurate.
>>
>>  3. Configureand defined the relevance rankingand matching logic of the
>>> return result.
>>>
>> Yes, though not by that name.
>> Step 1:
>> Configure default edismax parameters in your solrconfig.xml
>>
>> Step 2:
>> Create additional search handlers in solrconfig.xml, and each search
>> handler can have its own edismax configuration.
>>
>> Normally the format of the search URL is:
>>      http://localhost:8983/solr/collection_name/select?q=text:budget
>>
>> You would replace the "select" with the name of the search handler that
>> has the edismax config you want.
>>
>> With multiple search handlers, you'd use something like:
>>      http://localhost:8983/solr/collection_name/search_
>> freshest?q=text:budget
>>      http://localhost:8983/solr/collection_name/search_most_
>> popular?q=text:budget
>>
>>  4. Defined and configure the thesaurus (1-wayor 2-way), stemming and
>>> stop words.
>>>
>> Yes, Solr is very good about this, you have both options.
>>
>> Also, Solr let's you choose:
>> * Index time, or query time, or both
>> * Use expansion or reduction
>>
>> You can even have more than one thesaurus file and have them each handled
>> differently.
>>
>> For example:
>> * Use an english_language thesaurus, which rarely changes, and expand
>> that at index time
>> * Use your company_synonyms, which may change frequently, and expand them
>> at search time.
>>
>> I'll let you find these in the wiki, http://wiki.apache.org
>>
>>  5. Multi-language supportfor Simplified Chinese and Spanish.
>>>
>> Yes!
>>
>> And for simplified Chinese, please make sure to use the SmartCN analyzer,
>> and not the simplistic "CJK"; SmartCN actually looks for Chinese language
>> word breaks using statistical methods, and therefore should give better
>> results.
>>
>>  6. Scalability.
>>> At present, we are indexing 4million recordsand the number is expected
>>> to increase by more than 10 folds in the near future.
>>>
>> 40 million documents can normally be handled on a single machine,
>> assuming it has enough RAM and doesn't have a lot of other stuff running.
>> You might want a second machine for failover.
>>
>> When people use multiple machines, then the way to do that is via
>> SolrCloud.
>>
>>  7. Search results debugging. Eg. why record was matchedor why record was
>>> ranked as such.
>>>
>> Yes.
>>
>> You typically add &debugQuery=true&debug.explain.structured=true to the
>> URL.
>>
>> The output is a bit technical, it takes some practice to understand.
>>
>> There's also a graphical relevancy debugger with a free eval period:
>> http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
>>
>>  Derek
>>>
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and you
>>> must not use, disclose to anyone else or copy this e-mail (including any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>>
>>
>>
>>
>
> ----------------------
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>

Re: Does solr 4.8.1 support these features?

Posted by Mark Bennett <ma...@lucidworks.com>.

Derek,

Yes, you have several options.

1: You can maintain the 3 separate indexes, what Solr would typically call a "collection"

2: You could also combine the data into one larger collection and use a field to filter on.

3: A third option is to keep them separate (as in 1), but if you occasionally want to search all 3 you can do that as well from a single search with collection=.  Or if using SolrCloud you can also create a collection alias.  So this way you can easily search just 1 collection, or all 3, by changing just 1 parameter.

--
Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On Jun 10, 2014, at 9:03 PM, Derek Poh <dp...@globalsources.com> wrote:

> Mark
> 
> Looks like "edismax"support it, will read moreon it.
> 
> Onour current search application, we have a couple of indexes, each on specific typesof data.
> Eg. 1 index of product data, 1 index on supplier data, 1 index on category data.
> We query against eachindex for different searches (like product search or supplier search).
> It is commonly refer to as application/pipeline in Endeca.
> 
> Does solr support such setup?
> 
> 
> On 6/11/2014 6:23 AM, Mark Bennett wrote:
>> Derek,
>> 
>> The "edismax" parser is pretty amazing.  If I understand your questions, I think the answer is yes.
>> 
>> When people tune relevancy sometimes they apply very strong rules, they "yell" at the engine.  But it sounds like you already have a good instinct, to "whisper" at Relevancy, at least at the start, and to think in terms of tie breakers.
>> 
>> When you specify the fields that edismax is to search, you can give each of them a different weights.  I think this will do most of what you want.
>> 
>> Whether matches are combined via addition or multiplication can be controlled with different options in edismax, although sometimes you have to do a bit of reading and experimenting.
>> 
>> Another trick that I sometimes use is to use copyField so that the same field is indexed several different ways.  Then, the indexed field with an exact match is given a weight of 1.0, vs. a "fuzzy" match (for example with synonyms / thesaurus) is given only a weight of 0.5 or 0.3
>> 
>> --
>> Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
>> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>> 
>> On Jun 10, 2014, at 12:26 AM, Derek Poh <dp...@globalsources.com> wrote:
>> 
>>> Hi Mark
>>> 
>>> Appreciate you taking the time to reply and with references.
>>> 
>>> Regarding 3. Configure and defined the relevance ranking and matching logic of the return result.
>>> 
>>> Can each search handler be configure to
>>> - search on a few fields
>>> - assign a numeric rank to each of the field, such that a match on a field with the highest rank will rank the document higher in the return search result.
>>> - the ranking of each field will also act as tie-breaker.
>>> Eg.
>>> Category = 3
>>> SPPKeyWord= 2
>>> KeySpecification= 1
>>> 
>>> Document that has match on field Category will be ranked higher in the result than document that has match on SPPKeyWord.
>>> Document that has match only on field KeySpecification willrank the lowest in the result.
>>> 
>>> 
>>> On 6/10/2014 12:27 AM, Mark Bennett wrote:
>>>> Hello Derek,
>>>> 
>>>> See answers inline.
>>>> 
>>>> --
>>>> Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
>>>> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>>>> 
>>>> On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:
>>>> 
>>>>> My company is actively looking at alternative search engine applications to replace our current Endeca application.
>>>>> 
>>>>> I have no experience and knowledge on Solr and Lucene.
>>>>> Please bear with me, I would like to find out if the following features are available on Solr.
>>>>> 
>>>>> 1. Aggregate results (rollups).
>>>>> Eg. Froma list of search result of products (each has field = supplier id), can the results be aggregated by supplier id with the original results ordering retain.
>>>> Yes it can:
>>>> http://wiki.apache.org/solr/FieldCollapsing
>>>> 
>>>>> 2. Filter/Navigator, counts.
>>>>> List out a field's possible values and their counts fromthe indexed data and from the return results.
>>>>> The field's values can be sorted by the values description or by the values countsin the return results.
>>>> Yes, Solr calls these "Facets" and offers several types:
>>>> http://wiki.apache.org/solr/SimpleFacetParameters
>>>> http://wiki.apache.org/solr/HierarchicalFaceting
>>>> 
>>>>> Eg. Field "Business Type" belowwith it's possible values and the count for each value(in bracket). Can the field be return in the result with it's values sorted either by description or bycounts?
>>>>> Business Type
>>>>> Manufacturer (15269)
>>>>>    Exporter (12493)
>>>>>    Trading Company (5541)
>>>>>    Agent (1324)
>>>>>    Wholesaler (1202)
>>>>>    Importer (682)
>>>>>    Buying Office (394)
>>>>> Distributor (278)
>>>>>    Other (157)
>>>>>    Retailer (116)
>>>>>    Consultant (54)
>>>> Absolutely, and Solr is very fast and accurate.
>>>> 
>>>>> 3. Configureand defined the relevance rankingand matching logic of the return result.
>>>> Yes, though not by that name.
>>>> Step 1:
>>>> Configure default edismax parameters in your solrconfig.xml
>>>> 
>>>> Step 2:
>>>> Create additional search handlers in solrconfig.xml, and each search handler can have its own edismax configuration.
>>>> 
>>>> Normally the format of the search URL is:
>>>>     http://localhost:8983/solr/collection_name/select?q=text:budget
>>>> 
>>>> You would replace the "select" with the name of the search handler that has the edismax config you want.
>>>> 
>>>> With multiple search handlers, you'd use something like:
>>>>     http://localhost:8983/solr/collection_name/search_freshest?q=text:budget
>>>>     http://localhost:8983/solr/collection_name/search_most_popular?q=text:budget
>>>> 
>>>>> 4. Defined and configure the thesaurus (1-wayor 2-way), stemming and stop words.
>>>> Yes, Solr is very good about this, you have both options.
>>>> 
>>>> Also, Solr let's you choose:
>>>> * Index time, or query time, or both
>>>> * Use expansion or reduction
>>>> 
>>>> You can even have more than one thesaurus file and have them each handled differently.
>>>> 
>>>> For example:
>>>> * Use an english_language thesaurus, which rarely changes, and expand that at index time
>>>> * Use your company_synonyms, which may change frequently, and expand them at search time.
>>>> 
>>>> I'll let you find these in the wiki, http://wiki.apache.org
>>>> 
>>>>> 5. Multi-language supportfor Simplified Chinese and Spanish.
>>>> Yes!
>>>> 
>>>> And for simplified Chinese, please make sure to use the SmartCN analyzer, and not the simplistic "CJK"; SmartCN actually looks for Chinese language word breaks using statistical methods, and therefore should give better results.
>>>> 
>>>>> 6. Scalability.
>>>>> At present, we are indexing 4million recordsand the number is expected to increase by more than 10 folds in the near future.
>>>> 40 million documents can normally be handled on a single machine, assuming it has enough RAM and doesn't have a lot of other stuff running.
>>>> You might want a second machine for failover.
>>>> 
>>>> When people use multiple machines, then the way to do that is via SolrCloud.
>>>> 
>>>>> 7. Search results debugging. Eg. why record was matchedor why record was ranked as such.
>>>> Yes.
>>>> 
>>>> You typically add &debugQuery=true&debug.explain.structured=true to the URL.
>>>> 
>>>> The output is a bit technical, it takes some practice to understand.
>>>> 
>>>> There's also a graphical relevancy debugger with a free eval period:
>>>> http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
>>>> 
>>>>> Derek
>>>>> 
>>>>> ----------------------
>>>>> CONFIDENTIALITY NOTICE
>>>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>>>> 
>>> 
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>> 
>> 
> 
> 
> ----------------------
> CONFIDENTIALITY NOTICE 
> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 
> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Does solr 4.8.1 support these features?

Posted by Derek Poh <dp...@globalsources.com>.

Mark

Looks like "edismax"support it, will read moreon it.

Onour current search application, we have a couple of indexes, each on 
specific typesof data.
Eg. 1 index of product data, 1 index on supplier data, 1 index on 
category data.
We query against eachindex for different searches (like product search 
or supplier search).
It is commonly refer to as application/pipeline in Endeca.

Does solr support such setup?


On 6/11/2014 6:23 AM, Mark Bennett wrote:
> Derek,
>
> The "edismax" parser is pretty amazing.  If I understand your questions, I think the answer is yes.
>
> When people tune relevancy sometimes they apply very strong rules, they "yell" at the engine.  But it sounds like you already have a good instinct, to "whisper" at Relevancy, at least at the start, and to think in terms of tie breakers.
>
> When you specify the fields that edismax is to search, you can give each of them a different weights.  I think this will do most of what you want.
>
> Whether matches are combined via addition or multiplication can be controlled with different options in edismax, although sometimes you have to do a bit of reading and experimenting.
>
> Another trick that I sometimes use is to use copyField so that the same field is indexed several different ways.  Then, the indexed field with an exact match is given a weight of 1.0, vs. a "fuzzy" match (for example with synonyms / thesaurus) is given only a weight of 0.5 or 0.3
>
> --
> Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>
> On Jun 10, 2014, at 12:26 AM, Derek Poh <dp...@globalsources.com> wrote:
>
>> Hi Mark
>>
>> Appreciate you taking the time to reply and with references.
>>
>> Regarding 3. Configure and defined the relevance ranking and matching logic of the return result.
>>
>> Can each search handler be configure to
>> - search on a few fields
>> - assign a numeric rank to each of the field, such that a match on a field with the highest rank will rank the document higher in the return search result.
>> - the ranking of each field will also act as tie-breaker.
>> Eg.
>> Category = 3
>> SPPKeyWord= 2
>> KeySpecification= 1
>>
>> Document that has match on field Category will be ranked higher in the result than document that has match on SPPKeyWord.
>> Document that has match only on field KeySpecification willrank the lowest in the result.
>>
>>
>> On 6/10/2014 12:27 AM, Mark Bennett wrote:
>>> Hello Derek,
>>>
>>> See answers inline.
>>>
>>> --
>>> Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
>>> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>>>
>>> On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:
>>>
>>>> My company is actively looking at alternative search engine applications to replace our current Endeca application.
>>>>
>>>> I have no experience and knowledge on Solr and Lucene.
>>>> Please bear with me, I would like to find out if the following features are available on Solr.
>>>>
>>>> 1. Aggregate results (rollups).
>>>> Eg. Froma list of search result of products (each has field = supplier id), can the results be aggregated by supplier id with the original results ordering retain.
>>> Yes it can:
>>> http://wiki.apache.org/solr/FieldCollapsing
>>>
>>>> 2. Filter/Navigator, counts.
>>>> List out a field's possible values and their counts fromthe indexed data and from the return results.
>>>> The field's values can be sorted by the values description or by the values countsin the return results.
>>> Yes, Solr calls these "Facets" and offers several types:
>>> http://wiki.apache.org/solr/SimpleFacetParameters
>>> http://wiki.apache.org/solr/HierarchicalFaceting
>>>
>>>> Eg. Field "Business Type" belowwith it's possible values and the count for each value(in bracket). Can the field be return in the result with it's values sorted either by description or bycounts?
>>>> Business Type
>>>> Manufacturer (15269)
>>>>     Exporter (12493)
>>>>     Trading Company (5541)
>>>>     Agent (1324)
>>>>     Wholesaler (1202)
>>>>     Importer (682)
>>>>     Buying Office (394)
>>>> Distributor (278)
>>>>     Other (157)
>>>>     Retailer (116)
>>>>     Consultant (54)
>>> Absolutely, and Solr is very fast and accurate.
>>>
>>>> 3. Configureand defined the relevance rankingand matching logic of the return result.
>>> Yes, though not by that name.
>>> Step 1:
>>> Configure default edismax parameters in your solrconfig.xml
>>>
>>> Step 2:
>>> Create additional search handlers in solrconfig.xml, and each search handler can have its own edismax configuration.
>>>
>>> Normally the format of the search URL is:
>>>      http://localhost:8983/solr/collection_name/select?q=text:budget
>>>
>>> You would replace the "select" with the name of the search handler that has the edismax config you want.
>>>
>>> With multiple search handlers, you'd use something like:
>>>      http://localhost:8983/solr/collection_name/search_freshest?q=text:budget
>>>      http://localhost:8983/solr/collection_name/search_most_popular?q=text:budget
>>>
>>>> 4. Defined and configure the thesaurus (1-wayor 2-way), stemming and stop words.
>>> Yes, Solr is very good about this, you have both options.
>>>
>>> Also, Solr let's you choose:
>>> * Index time, or query time, or both
>>> * Use expansion or reduction
>>>
>>> You can even have more than one thesaurus file and have them each handled differently.
>>>
>>> For example:
>>> * Use an english_language thesaurus, which rarely changes, and expand that at index time
>>> * Use your company_synonyms, which may change frequently, and expand them at search time.
>>>
>>> I'll let you find these in the wiki, http://wiki.apache.org
>>>
>>>> 5. Multi-language supportfor Simplified Chinese and Spanish.
>>> Yes!
>>>
>>> And for simplified Chinese, please make sure to use the SmartCN analyzer, and not the simplistic "CJK"; SmartCN actually looks for Chinese language word breaks using statistical methods, and therefore should give better results.
>>>
>>>> 6. Scalability.
>>>> At present, we are indexing 4million recordsand the number is expected to increase by more than 10 folds in the near future.
>>> 40 million documents can normally be handled on a single machine, assuming it has enough RAM and doesn't have a lot of other stuff running.
>>> You might want a second machine for failover.
>>>
>>> When people use multiple machines, then the way to do that is via SolrCloud.
>>>
>>>> 7. Search results debugging. Eg. why record was matchedor why record was ranked as such.
>>> Yes.
>>>
>>> You typically add &debugQuery=true&debug.explain.structured=true to the URL.
>>>
>>> The output is a bit technical, it takes some practice to understand.
>>>
>>> There's also a graphical relevancy debugger with a free eval period:
>>> http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
>>>
>>>> Derek
>>>>
>>>> ----------------------
>>>> CONFIDENTIALITY NOTICE
>>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>>>
>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>
>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Does solr 4.8.1 support these features?

Posted by Mark Bennett <ma...@lucidworks.com>.

Derek,

The "edismax" parser is pretty amazing.  If I understand your questions, I think the answer is yes.

When people tune relevancy sometimes they apply very strong rules, they "yell" at the engine.  But it sounds like you already have a good instinct, to "whisper" at Relevancy, at least at the start, and to think in terms of tie breakers.

When you specify the fields that edismax is to search, you can give each of them a different weights.  I think this will do most of what you want.

Whether matches are combined via addition or multiplication can be controlled with different options in edismax, although sometimes you have to do a bit of reading and experimenting.

Another trick that I sometimes use is to use copyField so that the same field is indexed several different ways.  Then, the indexed field with an exact match is given a weight of 1.0, vs. a "fuzzy" match (for example with synonyms / thesaurus) is given only a weight of 0.5 or 0.3

--
Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On Jun 10, 2014, at 12:26 AM, Derek Poh <dp...@globalsources.com> wrote:

> Hi Mark
> 
> Appreciate you taking the time to reply and with references.
> 
> Regarding 3. Configure and defined the relevance ranking and matching logic of the return result.
> 
> Can each search handler be configure to
> - search on a few fields
> - assign a numeric rank to each of the field, such that a match on a field with the highest rank will rank the document higher in the return search result.
> - the ranking of each field will also act as tie-breaker.
> Eg.
> Category = 3
> SPPKeyWord= 2
> KeySpecification= 1
> 
> Document that has match on field Category will be ranked higher in the result than document that has match on SPPKeyWord.
> Document that has match only on field KeySpecification willrank the lowest in the result.
> 
> 
> On 6/10/2014 12:27 AM, Mark Bennett wrote:
>> Hello Derek,
>> 
>> See answers inline.
>> 
>> --
>> Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
>> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>> 
>> On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:
>> 
>>> My company is actively looking at alternative search engine applications to replace our current Endeca application.
>>> 
>>> I have no experience and knowledge on Solr and Lucene.
>>> Please bear with me, I would like to find out if the following features are available on Solr.
>>> 
>>> 1. Aggregate results (rollups).
>>> Eg. Froma list of search result of products (each has field = supplier id), can the results be aggregated by supplier id with the original results ordering retain.
>> Yes it can:
>> http://wiki.apache.org/solr/FieldCollapsing
>> 
>>> 2. Filter/Navigator, counts.
>>> List out a field's possible values and their counts fromthe indexed data and from the return results.
>>> The field's values can be sorted by the values description or by the values countsin the return results.
>> Yes, Solr calls these "Facets" and offers several types:
>> http://wiki.apache.org/solr/SimpleFacetParameters
>> http://wiki.apache.org/solr/HierarchicalFaceting
>> 
>>> Eg. Field "Business Type" belowwith it's possible values and the count for each value(in bracket). Can the field be return in the result with it's values sorted either by description or bycounts?
>>> Business Type
>>> Manufacturer (15269)
>>>    Exporter (12493)
>>>    Trading Company (5541)
>>>    Agent (1324)
>>>    Wholesaler (1202)
>>>    Importer (682)
>>>    Buying Office (394)
>>> Distributor (278)
>>>    Other (157)
>>>    Retailer (116)
>>>    Consultant (54)
>> Absolutely, and Solr is very fast and accurate.
>> 
>>> 3. Configureand defined the relevance rankingand matching logic of the return result.
>> Yes, though not by that name.
>> Step 1:
>> Configure default edismax parameters in your solrconfig.xml
>> 
>> Step 2:
>> Create additional search handlers in solrconfig.xml, and each search handler can have its own edismax configuration.
>> 
>> Normally the format of the search URL is:
>>     http://localhost:8983/solr/collection_name/select?q=text:budget
>> 
>> You would replace the "select" with the name of the search handler that has the edismax config you want.
>> 
>> With multiple search handlers, you'd use something like:
>>     http://localhost:8983/solr/collection_name/search_freshest?q=text:budget
>>     http://localhost:8983/solr/collection_name/search_most_popular?q=text:budget
>> 
>>> 4. Defined and configure the thesaurus (1-wayor 2-way), stemming and stop words.
>> Yes, Solr is very good about this, you have both options.
>> 
>> Also, Solr let's you choose:
>> * Index time, or query time, or both
>> * Use expansion or reduction
>> 
>> You can even have more than one thesaurus file and have them each handled differently.
>> 
>> For example:
>> * Use an english_language thesaurus, which rarely changes, and expand that at index time
>> * Use your company_synonyms, which may change frequently, and expand them at search time.
>> 
>> I'll let you find these in the wiki, http://wiki.apache.org
>> 
>>> 5. Multi-language supportfor Simplified Chinese and Spanish.
>> Yes!
>> 
>> And for simplified Chinese, please make sure to use the SmartCN analyzer, and not the simplistic "CJK"; SmartCN actually looks for Chinese language word breaks using statistical methods, and therefore should give better results.
>> 
>>> 6. Scalability.
>>> At present, we are indexing 4million recordsand the number is expected to increase by more than 10 folds in the near future.
>> 40 million documents can normally be handled on a single machine, assuming it has enough RAM and doesn't have a lot of other stuff running.
>> You might want a second machine for failover.
>> 
>> When people use multiple machines, then the way to do that is via SolrCloud.
>> 
>>> 7. Search results debugging. Eg. why record was matchedor why record was ranked as such.
>> Yes.
>> 
>> You typically add &debugQuery=true&debug.explain.structured=true to the URL.
>> 
>> The output is a bit technical, it takes some practice to understand.
>> 
>> There's also a graphical relevancy debugger with a free eval period:
>> http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
>> 
>>> Derek
>>> 
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>> 
>> 
> 
> 
> ----------------------
> CONFIDENTIALITY NOTICE 
> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 
> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Does solr 4.8.1 support these features?

Posted by Derek Poh <dp...@globalsources.com>.

Hi Mark

Appreciate you taking the time to reply and with references.

Regarding 3. Configure and defined the relevance ranking and matching 
logic of the return result.

Can each search handler be configure to
- search on a few fields
- assign a numeric rank to each of the field, such that a match on a 
field with the highest rank will rank the document higher in the return 
search result.
- the ranking of each field will also act as tie-breaker.
Eg.
Category = 3
SPPKeyWord= 2
KeySpecification= 1

Document that has match on field Category will be ranked higher in the 
result than document that has match on SPPKeyWord.
Document that has match only on field KeySpecification willrank the 
lowest in the result.


On 6/10/2014 12:27 AM, Mark Bennett wrote:
> Hello Derek,
>
> See answers inline.
>
> --
> Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
> Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
>
> On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:
>
>> My company is actively looking at alternative search engine applications to replace our current Endeca application.
>>
>> I have no experience and knowledge on Solr and Lucene.
>> Please bear with me, I would like to find out if the following features are available on Solr.
>>
>> 1. Aggregate results (rollups).
>> Eg. Froma list of search result of products (each has field = supplier id), can the results be aggregated by supplier id with the original results ordering retain.
> Yes it can:
> http://wiki.apache.org/solr/FieldCollapsing
>
>> 2. Filter/Navigator, counts.
>> List out a field's possible values and their counts fromthe indexed data and from the return results.
>> The field's values can be sorted by the values description or by the values countsin the return results.
> Yes, Solr calls these "Facets" and offers several types:
> http://wiki.apache.org/solr/SimpleFacetParameters
> http://wiki.apache.org/solr/HierarchicalFaceting
>
>> Eg. Field "Business Type" belowwith it's possible values and the count for each value(in bracket). Can the field be return in the result with it's values sorted either by description or bycounts?
>> Business Type
>> Manufacturer (15269)
>>     Exporter (12493)
>>     Trading Company (5541)
>>     Agent (1324)
>>     Wholesaler (1202)
>>     Importer (682)
>>     Buying Office (394)
>> Distributor (278)
>>     Other (157)
>>     Retailer (116)
>>     Consultant (54)
> Absolutely, and Solr is very fast and accurate.
>
>> 3. Configureand defined the relevance rankingand matching logic of the return result.
> Yes, though not by that name.
> Step 1:
> Configure default edismax parameters in your solrconfig.xml
>
> Step 2:
> Create additional search handlers in solrconfig.xml, and each search handler can have its own edismax configuration.
>
> Normally the format of the search URL is:
>      http://localhost:8983/solr/collection_name/select?q=text:budget
>
> You would replace the "select" with the name of the search handler that has the edismax config you want.
>
> With multiple search handlers, you'd use something like:
>      http://localhost:8983/solr/collection_name/search_freshest?q=text:budget
>      http://localhost:8983/solr/collection_name/search_most_popular?q=text:budget
>
>> 4. Defined and configure the thesaurus (1-wayor 2-way), stemming and stop words.
> Yes, Solr is very good about this, you have both options.
>
> Also, Solr let's you choose:
> * Index time, or query time, or both
> * Use expansion or reduction
>
> You can even have more than one thesaurus file and have them each handled differently.
>
> For example:
> * Use an english_language thesaurus, which rarely changes, and expand that at index time
> * Use your company_synonyms, which may change frequently, and expand them at search time.
>
> I'll let you find these in the wiki, http://wiki.apache.org
>
>> 5. Multi-language supportfor Simplified Chinese and Spanish.
> Yes!
>
> And for simplified Chinese, please make sure to use the SmartCN analyzer, and not the simplistic "CJK"; SmartCN actually looks for Chinese language word breaks using statistical methods, and therefore should give better results.
>
>> 6. Scalability.
>> At present, we are indexing 4million recordsand the number is expected to increase by more than 10 folds in the near future.
> 40 million documents can normally be handled on a single machine, assuming it has enough RAM and doesn't have a lot of other stuff running.
> You might want a second machine for failover.
>
> When people use multiple machines, then the way to do that is via SolrCloud.
>
>> 7. Search results debugging. Eg. why record was matchedor why record was ranked as such.
> Yes.
>
> You typically add &debugQuery=true&debug.explain.structured=true to the URL.
>
> The output is a bit technical, it takes some practice to understand.
>
> There's also a graphical relevancy debugger with a free eval period:
> http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
>
>> Derek
>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>
>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Does solr 4.8.1 support these features?

Posted by Mark Bennett <ma...@lucidworks.com>.

Hello Derek,

See answers inline.

--
Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On Jun 9, 2014, at 12:00 AM, Derek Poh <dp...@globalsources.com> wrote:

> My company is actively looking at alternative search engine applications to replace our current Endeca application.
> 
> I have no experience and knowledge on Solr and Lucene.
> Please bear with me, I would like to find out if the following features are available on Solr.
> 
> 1. Aggregate results (rollups).
> Eg. Froma list of search result of products (each has field = supplier id), can the results be aggregated by supplier id with the original results ordering retain.
Yes it can:
http://wiki.apache.org/solr/FieldCollapsing

> 2. Filter/Navigator, counts.
> List out a field's possible values and their counts fromthe indexed data and from the return results.
> The field's values can be sorted by the values description or by the values countsin the return results.
Yes, Solr calls these "Facets" and offers several types:
http://wiki.apache.org/solr/SimpleFacetParameters
http://wiki.apache.org/solr/HierarchicalFaceting

> Eg. Field "Business Type" belowwith it's possible values and the count for each value(in bracket). Can the field be return in the result with it's values sorted either by description or bycounts?
> Business Type
> Manufacturer (15269)
>    Exporter (12493)
>    Trading Company (5541)
>    Agent (1324)
>    Wholesaler (1202)
>    Importer (682)
>    Buying Office (394)
> Distributor (278)
>    Other (157)
>    Retailer (116)
>    Consultant (54)

Absolutely, and Solr is very fast and accurate.

> 3. Configureand defined the relevance rankingand matching logic of the return result.
Yes, though not by that name.
Step 1:
Configure default edismax parameters in your solrconfig.xml

Step 2:
Create additional search handlers in solrconfig.xml, and each search handler can have its own edismax configuration.

Normally the format of the search URL is:
    http://localhost:8983/solr/collection_name/select?q=text:budget

You would replace the "select" with the name of the search handler that has the edismax config you want.

With multiple search handlers, you'd use something like:
    http://localhost:8983/solr/collection_name/search_freshest?q=text:budget
    http://localhost:8983/solr/collection_name/search_most_popular?q=text:budget

> 4. Defined and configure the thesaurus (1-wayor 2-way), stemming and stop words.
Yes, Solr is very good about this, you have both options.

Also, Solr let's you choose:
* Index time, or query time, or both
* Use expansion or reduction

You can even have more than one thesaurus file and have them each handled differently.

For example:
* Use an english_language thesaurus, which rarely changes, and expand that at index time
* Use your company_synonyms, which may change frequently, and expand them at search time.

I'll let you find these in the wiki, http://wiki.apache.org

> 
> 5. Multi-language supportfor Simplified Chinese and Spanish.
Yes!

And for simplified Chinese, please make sure to use the SmartCN analyzer, and not the simplistic "CJK"; SmartCN actually looks for Chinese language word breaks using statistical methods, and therefore should give better results.

> 
> 6. Scalability.
> At present, we are indexing 4million recordsand the number is expected to increase by more than 10 folds in the near future.
40 million documents can normally be handled on a single machine, assuming it has enough RAM and doesn't have a lot of other stuff running.
You might want a second machine for failover.

When people use multiple machines, then the way to do that is via SolrCloud.

> 7. Search results debugging. Eg. why record was matchedor why record was ranked as such.
Yes.

You typically add &debugQuery=true&debug.explain.structured=true to the URL.

The output is a bit technical, it takes some practice to understand.

There's also a graphical relevancy debugger with a free eval period:
http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/

> 
> Derek
> 
> ----------------------
> CONFIDENTIALITY NOTICE 
> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 
> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.