You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yetkin Ozkucur <Ye...@asg.com> on 2014/05/01 16:04:06 UTC

Searching for tokens does not return any results

Hello everyone,

I am new to SOLR and this is my first post in this list. 
I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something.

Here is my problem:
I have a field called: DBASE_LOCAT_NM_TEXT
It contains values like: CRD_PROD
The goal is to be able to search this field either by putting the exact string "CRD_PROD" or part of it (tokenized by "_")  like "CRD" or "PROD"

Currently: 
This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD
But this does not: q=DBASE_LOCAT_NM_TEXT:CRD
I want to understand why the second query does not return any results

Here is how I configured the field:
<field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>

And Here is how I configured the field type :
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"  ignoreCase="true" words="stopwords.txt"/>
         <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>

        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
    </fieldType>

I am also using the analysis panel in the SOLR admin console. It shows this:
WT	CRD_PROD

WDF	CRD_PROD
	CRD
	PROD
	CRDPROD

SF	CRD_PROD
	CRD
	PROD
	CRDPROD

LCF	crd_prod
	crd
	prod
	crdprod

SKMF	crd_prod
	crd
	prod
	crdprod

RDTF	crd_prod
	crd
	prod
	crdprod


I am not sure if it is related or not but this index was created using a Java program using Lucene interface. It used StandardAnalyzer for writing and the field was configured as tokenized, indexed and stored.  Does this affect the SOLR configuration?
	
Can you please help me understand what I am missing and how I can debug it?

Thanks,
Yetkin

Re: Roll up query with original facets

Posted by Darin Amos <da...@gmail.com>.
My apologies!!
On May 1, 2014 6:56 PM, "Chris Hostetter" <ho...@fucit.org> wrote:

>
> https://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
>
>
> : Subject: Roll up query with original facets
> : From: Darin Amos <da...@gmail.com>
> : In-Reply-To: <1398953952.39792.YahooMailNeo@web124702.mail.ne1.yahoo.com
> >
> : Message-Id: <59...@gmail.com>
> : References: <
> D6259D1CCF526540B1CB447E5F3BC39B8E344F52C0@gechem8mail.asg.com>
> :  <13...@web124702.mail.ne1.yahoo.com>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Roll up query with original facets

Posted by Chris Hostetter <ho...@fucit.org>.
https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.


: Subject: Roll up query with original facets
: From: Darin Amos <da...@gmail.com>
: In-Reply-To: <13...@web124702.mail.ne1.yahoo.com>
: Message-Id: <59...@gmail.com>
: References: <D6...@gechem8mail.asg.com>
:  <13...@web124702.mail.ne1.yahoo.com>


-Hoss
http://www.lucidworks.com/

Roll up query with original facets

Posted by Darin Amos <da...@gmail.com>.
Hello All,

I am having a query issue I cannot seem to find the correct answer for. I am searching against a list of items and returning facets for that list of items. I would like to group the result set on a field such as a “parentItemId”. parentItemId maps to other documents within the same core. I would like my query to return the documents that match parentItemId, but still return the facets of the original query.

Is this possible with SOLR 4.3 that I am running? I can provide more details if needed, thanks!

Darin

Re: Searching for tokens does not return any results

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Yetkin,

You are on the right track by examining analysis page. How is your query analyzed using query analyzer?

According to what you pasted q=CRD should return your example document.

Did you change something in schema.xml and forget to re-start solr and  re-index?

By the way simple letter tokenizer based lowercase tokenizer seems a better fit to your use-case. With this you dont have deal with WDF's parameters.

https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-LowerCaseTokenizer

Ahmet





On Thursday, May 1, 2014 5:04 PM, Yetkin Ozkucur <Ye...@asg.com> wrote:
Hello everyone,

I am new to SOLR and this is my first post in this list. 
I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something.

Here is my problem:
I have a field called: DBASE_LOCAT_NM_TEXT
It contains values like: CRD_PROD
The goal is to be able to search this field either by putting the exact string "CRD_PROD" or part of it (tokenized by "_")  like "CRD" or "PROD"

Currently: 
This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD
But this does not: q=DBASE_LOCAT_NM_TEXT:CRD
I want to understand why the second query does not return any results

Here is how I configured the field:
<field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>

And Here is how I configured the field type :
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"  ignoreCase="true" words="stopwords.txt"/>
         <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>

        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
    </fieldType>

I am also using the analysis panel in the SOLR admin console. It shows this:
WT    CRD_PROD

WDF    CRD_PROD
    CRD
    PROD
    CRDPROD

SF    CRD_PROD
    CRD
    PROD
    CRDPROD

LCF    crd_prod
    crd
    prod
    crdprod

SKMF    crd_prod
    crd
    prod
    crdprod

RDTF    crd_prod
    crd
    prod
    crdprod


I am not sure if it is related or not but this index was created using a Java program using Lucene interface. It used StandardAnalyzer for writing and the field was configured as tokenized, indexed and stored.  Does this affect the SOLR configuration?
    
Can you please help me understand what I am missing and how I can debug it?

Thanks,
Yetkin 

Re: Searching for tokens does not return any results

Posted by Erick Erickson <er...@gmail.com>.
Glad to hear it!

You shouldn't really have to customize the analyzer to get it to
behave as it would if you just used Solr to ingest documents, just
chain things together. That's what Solr does after all. Of course you
may have special needs that are better served by more customization.

TermsComponent is a useful tool. Note that you also get raw terms if
you use the admin/schema-browser page, identify your field, and then
click the "show term info" button. That technique is somewhat limited
though. The schema-browser page is especially useful for very small
indexes and/or test cases I'll admit. I do vaguely remember something
not right with the schema-browser at one point though, so it might not
work as I expect for 4.4

Best,
Erick

On Fri, May 2, 2014 at 1:56 PM, Yetkin Ozkucur <Ye...@asg.com> wrote:
> Erick, Koji, Ahmet:
>
> Thank you all for your answers! I think I found the problem and I am on the right track to fix it.
>
> 1- As you suggested the problem was in the Java code populating the index. The analyzer in the Java code had to be consistent with the one defined in SOLR. I was able to achieve my goal by creating a slightly customized analyzer.
> 2- To be able to see the tokens in the index was key to debug the problem. I downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to see tokens. I did not know SOLR had that terms component. That is a good tip too.
>
> Have a good weekend.
>
> Thanks,
> Yetkin
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, May 02, 2014 11:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Searching for tokens does not return any results
>
> bq:  but this index was created using a Java program using Lucene interface
>
> Elaborating a bit on Koji's comment...
>
> The fact that you used Lucene to index the doc means that the analysis page is almost, but not quite entirely, useless on the indexing side.
> It's looking at your field definition in schema.xml and running your input stream through the indexing portion of your analysis chain constructed from the schema. What's actually in your index though was put there by raw Lucene. So your Lucene program _must_ create an analysis chain that is absolutely identical to what's in your schema for the admin/analysis page to be accurate.
>
> Quick test: go to you "admin/schema browser" page or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> or Luke to examine the actual tokens in your field. My bet is that you'll see that the actual terms are not what you expect and almost certainly not what the admin/analysis page shows on the index side.
>
> Keeping an independent Lucene program that puts data into your index with raw Lucene aligned with your schema is, as you can see, something of a problem. If at all possible, consider letting Solr do the indexing and sending it documents with SolrJ, here's a reference:
> https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
>
> By the way, I want to compliment you on your post. You did all the right things:
>> defined your problem clearly
>> added the critical bit (index created with Lucene). This is especially
>> relevant I think illustrated the input and output told us what the
>> problem was gave us the field definitions showed the results of some
>> of your investigation
>
> Best
> Erick
>
> On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
>> Hi Yetkin, welcome!
>>
>> I think StandardAnalyzer of Lucene is the problem you are facing.
>>
>> Why don't you have another field using StandardAnalyzer and see how it
>> tokenizes CRD_PROD on Solr admin GUI?
>>
>> I forgot in the detail but we can use Lucene's Analyzer in schema.xml
>> something like this:
>>
>> <fieldType ...>
>>    <analyzer class="solr.StandardAnalyzer"/> </fieldType>
>>
>> Koji
>> --
>> http://soleami.com/blog/comparing-document-classification-functions-of
>> -lucene-and-mahout.html
>>
>>
>> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>>
>>> Hello everyone,
>>>
>>> I am new to SOLR and this is my first post in this list.
>>> I have been working on this problem for a couple of days. I tried
>>> everything which I found in google but it looks like I am missing something.
>>>
>>> Here is my problem:
>>> I have a field called: DBASE_LOCAT_NM_TEXT It contains values like:
>>> CRD_PROD The goal is to be able to search this field either by
>>> putting the exact string "CRD_PROD" or part of it (tokenized by "_")
>>> like "CRD" or "PROD"
>>>
>>> Currently:
>>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this
>>> does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the
>>> second query does not return any results
>>>
>>> Here is how I configured the field:
>>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>>> stored="true" required="false" multiValued="false"/>
>>>
>>> And Here is how I configured the field type :
>>>      <fieldType name="text_general" class="solr.TextField"
>>> positionIncrementGap="100">
>>>        <analyzer type="index">
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>>> splitOnCaseChange="1"/>
>>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>>> words="stopwords.txt"/>
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>          <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>        </analyzer>
>>>        <analyzer type="query">
>>>          <filter class="solr.WordDelimiterFilterFactory"
>>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>>> splitOnCaseChange="1"/>
>>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt"/>
>>>
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>          <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>
>>>        </analyzer>
>>>      </fieldType>
>>>
>>> I am also using the analysis panel in the SOLR admin console. It
>>> shows
>>> this:
>>> WT      CRD_PROD
>>>
>>> WDF     CRD_PROD
>>>         CRD
>>>         PROD
>>>         CRDPROD
>>>
>>> SF      CRD_PROD
>>>         CRD
>>>         PROD
>>>         CRDPROD
>>>
>>> LCF     crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>> SKMF    crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>> RDTF    crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>>
>>> I am not sure if it is related or not but this index was created
>>> using a Java program using Lucene interface. It used StandardAnalyzer
>>> for writing and the field was configured as tokenized, indexed and
>>> stored.  Does this affect the SOLR configuration?
>>>
>>> Can you please help me understand what I am missing and how I can
>>> debug it?
>>>
>>> Thanks,
>>> Yetkin
>>>
>>
>>
>>

RE: Searching for tokens does not return any results

Posted by Yetkin Ozkucur <Ye...@asg.com>.
Erick, Koji, Ahmet:

Thank you all for your answers! I think I found the problem and I am on the right track to fix it.

1- As you suggested the problem was in the Java code populating the index. The analyzer in the Java code had to be consistent with the one defined in SOLR. I was able to achieve my goal by creating a slightly customized analyzer.
2- To be able to see the tokens in the index was key to debug the problem. I downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to see tokens. I did not know SOLR had that terms component. That is a good tip too.

Have a good weekend.

Thanks,
Yetkin

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, May 02, 2014 11:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching for tokens does not return any results

bq:  but this index was created using a Java program using Lucene interface

Elaborating a bit on Koji's comment...

The fact that you used Lucene to index the doc means that the analysis page is almost, but not quite entirely, useless on the indexing side.
It's looking at your field definition in schema.xml and running your input stream through the indexing portion of your analysis chain constructed from the schema. What's actually in your index though was put there by raw Lucene. So your Lucene program _must_ create an analysis chain that is absolutely identical to what's in your schema for the admin/analysis page to be accurate.

Quick test: go to you "admin/schema browser" page or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
or Luke to examine the actual tokens in your field. My bet is that you'll see that the actual terms are not what you expect and almost certainly not what the admin/analysis page shows on the index side.

Keeping an independent Lucene program that puts data into your index with raw Lucene aligned with your schema is, as you can see, something of a problem. If at all possible, consider letting Solr do the indexing and sending it documents with SolrJ, here's a reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

By the way, I want to compliment you on your post. You did all the right things:
> defined your problem clearly
> added the critical bit (index created with Lucene). This is especially 
> relevant I think illustrated the input and output told us what the 
> problem was gave us the field definitions showed the results of some 
> of your investigation

Best
Erick

On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
> Hi Yetkin, welcome!
>
> I think StandardAnalyzer of Lucene is the problem you are facing.
>
> Why don't you have another field using StandardAnalyzer and see how it 
> tokenizes CRD_PROD on Solr admin GUI?
>
> I forgot in the detail but we can use Lucene's Analyzer in schema.xml 
> something like this:
>
> <fieldType ...>
>    <analyzer class="solr.StandardAnalyzer"/> </fieldType>
>
> Koji
> --
> http://soleami.com/blog/comparing-document-classification-functions-of
> -lucene-and-mahout.html
>
>
> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>
>> Hello everyone,
>>
>> I am new to SOLR and this is my first post in this list.
>> I have been working on this problem for a couple of days. I tried 
>> everything which I found in google but it looks like I am missing something.
>>
>> Here is my problem:
>> I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: 
>> CRD_PROD The goal is to be able to search this field either by 
>> putting the exact string "CRD_PROD" or part of it (tokenized by "_")  
>> like "CRD" or "PROD"
>>
>> Currently:
>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this 
>> does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the 
>> second query does not return any results
>>
>> Here is how I configured the field:
>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>> stored="true" required="false" multiValued="false"/>
>>
>> And Here is how I configured the field type :
>>      <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">
>>        <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>> words="stopwords.txt"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        </analyzer>
>>        <analyzer type="query">
>>          <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>      </fieldType>
>>
>> I am also using the analysis panel in the SOLR admin console. It 
>> shows
>> this:
>> WT      CRD_PROD
>>
>> WDF     CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> SF      CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> LCF     crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> SKMF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> RDTF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>>
>> I am not sure if it is related or not but this index was created 
>> using a Java program using Lucene interface. It used StandardAnalyzer 
>> for writing and the field was configured as tokenized, indexed and 
>> stored.  Does this affect the SOLR configuration?
>>
>> Can you please help me understand what I am missing and how I can 
>> debug it?
>>
>> Thanks,
>> Yetkin
>>
>
>
>

Re: Searching for tokens does not return any results

Posted by Erick Erickson <er...@gmail.com>.
bq:  but this index was created using a Java program using Lucene interface

Elaborating a bit on Koji's comment...

The fact that you used Lucene to index the doc means that the analysis
page is almost, but not quite entirely, useless on the indexing side.
It's looking at your field definition in schema.xml and running your
input stream through the indexing portion of your analysis chain
constructed from the schema. What's actually in your index though was
put there by raw Lucene. So your Lucene program _must_ create an
analysis chain that is absolutely identical to what's in your schema
for the admin/analysis page to be accurate.

Quick test: go to you "admin/schema browser" page or use the
TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
or Luke to examine the actual tokens in your field. My bet is that
you'll see that the actual terms are not what you expect and almost
certainly not what the admin/analysis page shows on the index side.

Keeping an independent Lucene program that puts data into your index
with raw Lucene aligned with your schema is, as you can see, something
of a problem. If at all possible, consider letting Solr do the
indexing and sending it documents with SolrJ, here's a reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

By the way, I want to compliment you on your post. You did all the right things:
> defined your problem clearly
> added the critical bit (index created with Lucene). This is especially relevant I think
> illustrated the input and output
> told us what the problem was
> gave us the field definitions
> showed the results of some of your investigation

Best
Erick

On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
> Hi Yetkin, welcome!
>
> I think StandardAnalyzer of Lucene is the problem you are facing.
>
> Why don't you have another field using StandardAnalyzer and see how it
> tokenizes CRD_PROD
> on Solr admin GUI?
>
> I forgot in the detail but we can use Lucene's Analyzer in schema.xml
> something like this:
>
> <fieldType ...>
>    <analyzer class="solr.StandardAnalyzer"/>
> </fieldType>
>
> Koji
> --
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
>
>
> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>
>> Hello everyone,
>>
>> I am new to SOLR and this is my first post in this list.
>> I have been working on this problem for a couple of days. I tried
>> everything which I found in google but it looks like I am missing something.
>>
>> Here is my problem:
>> I have a field called: DBASE_LOCAT_NM_TEXT
>> It contains values like: CRD_PROD
>> The goal is to be able to search this field either by putting the exact
>> string "CRD_PROD" or part of it (tokenized by "_")  like "CRD" or "PROD"
>>
>> Currently:
>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD
>> But this does not: q=DBASE_LOCAT_NM_TEXT:CRD
>> I want to understand why the second query does not return any results
>>
>> Here is how I configured the field:
>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>> stored="true" required="false" multiValued="false"/>
>>
>> And Here is how I configured the field type :
>>      <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">
>>        <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>> words="stopwords.txt"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        </analyzer>
>>        <analyzer type="query">
>>          <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>      </fieldType>
>>
>> I am also using the analysis panel in the SOLR admin console. It shows
>> this:
>> WT      CRD_PROD
>>
>> WDF     CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> SF      CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> LCF     crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> SKMF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> RDTF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>>
>> I am not sure if it is related or not but this index was created using a
>> Java program using Lucene interface. It used StandardAnalyzer for writing
>> and the field was configured as tokenized, indexed and stored.  Does this
>> affect the SOLR configuration?
>>
>> Can you please help me understand what I am missing and how I can debug
>> it?
>>
>> Thanks,
>> Yetkin
>>
>
>
>

Re: Searching for tokens does not return any results

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Hi Yetkin, welcome!

I think StandardAnalyzer of Lucene is the problem you are facing.

Why don't you have another field using StandardAnalyzer and see how it tokenizes CRD_PROD
on Solr admin GUI?

I forgot in the detail but we can use Lucene's Analyzer in schema.xml something like this:

<fieldType ...>
    <analyzer class="solr.StandardAnalyzer"/>
</fieldType>

Koji
-- 
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/05/01 23:04), Yetkin Ozkucur wrote:
> Hello everyone,
>
> I am new to SOLR and this is my first post in this list.
> I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something.
>
> Here is my problem:
> I have a field called: DBASE_LOCAT_NM_TEXT
> It contains values like: CRD_PROD
> The goal is to be able to search this field either by putting the exact string "CRD_PROD" or part of it (tokenized by "_")  like "CRD" or "PROD"
>
> Currently:
> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD
> But this does not: q=DBASE_LOCAT_NM_TEXT:CRD
> I want to understand why the second query does not return any results
>
> Here is how I configured the field:
> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
>
> And Here is how I configured the field type :
>      <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>        <analyzer type="index">
>        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.StopFilterFactory"  ignoreCase="true" words="stopwords.txt"/>
>           <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>          <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>        </analyzer>
>      </fieldType>
>
> I am also using the analysis panel in the SOLR admin console. It shows this:
> WT	CRD_PROD
>
> WDF	CRD_PROD
> 	CRD
> 	PROD
> 	CRDPROD
>
> SF	CRD_PROD
> 	CRD
> 	PROD
> 	CRDPROD
>
> LCF	crd_prod
> 	crd
> 	prod
> 	crdprod
>
> SKMF	crd_prod
> 	crd
> 	prod
> 	crdprod
>
> RDTF	crd_prod
> 	crd
> 	prod
> 	crdprod
>
>
> I am not sure if it is related or not but this index was created using a Java program using Lucene interface. It used StandardAnalyzer for writing and the field was configured as tokenized, indexed and stored.  Does this affect the SOLR configuration?
> 	
> Can you please help me understand what I am missing and how I can debug it?
>
> Thanks,
> Yetkin
>