You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Per Newgro <pe...@gmx.ch> on 2011/07/20 17:57:59 UTC

How can i find a document by a special id?

Hi,

i'm new to solr. I built an application using the standard solr 3.3 
examples as default.
My id field is a string and is copied to a solr.TextField ("searchtext") 
for search queries.
All works fine except i try to get documents by a special id.

Let me explain the detail's. Assume id = "1234567". I would like to 
query this document
by using q=searchtext:AB1234567. The prefix ("AB") is acting as a 
pseudo-id in our
system. Users know and search for it. But it's not findable because 
solr-index only knows
the "short id".

Adding a new document with the prefixed-id as id is not an option. Then 
i have to add
many documents.

For my understanding stemming and ngram tokenizing is not possible
because they act on tokens longer then the search token.

How can i do this?

Thanks
Per

Re: How can i find a document by a special id?

Posted by Per Newgro <pe...@gmx.ch>.

Sorry for being that stupid. I've modified the wrong schema.

So the "solr.WordDelimiterFilterFactory" works as expected and solved my problem. I've added the line
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumericChange="1"/>

to my schema and test is green.

Thanks all for helping me
Per


-------- Original-Nachricht --------
> Datum: Thu, 21 Jul 2011 09:53:23 +0200
> Von: "Per Newgro" <pe...@gmx.ch>
> An: solr-user@lucene.apache.org
> Betreff: Re: How can i find a document by a special id?

> The problem is that i didn't store the mediacode in a field. Because the
> code is used frequently for getting the customer source.
> 
> So far i've found the "solr.WordDelimiterFilterFactory" which is (from
> Wiki) the way to go. The problem seems to be that i'm searching a "longer"
> string then i've indexed. I only index the numeric id (12345).
> But query string is BR12345. I don't get any results. Can i fine-tune
> the WDFF somehow? 
> 
> By using the admin/analysis.jsp 
> 
> Index Analyzer
> org.apache.solr.analysis.StandardTokenizerFactory
> {luceneMatchVersion=LUCENE_33}
> position 	1
> term text 	BR12345
> startOffset 	0
> endOffset 	7
> type 	<ALPHANUM>
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true, luceneMatchVersion=LUCENE_33}
> position 	1
> term text 	BR12345
> startOffset 	0
> endOffset 	7
> type 	<ALPHANUM>
> org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
> generateWordParts=1, catenateAll=0, catenateNumbers=1}
> position 	1	2
> term text 	BR	12345
> startOffset 	0	2
> endOffset 	2	7
> type 	<ALPHANUM>	<ALPHANUM>
> org.apache.solr.analysis.LowerCaseFilterFactory
> {luceneMatchVersion=LUCENE_33}
> position 	1	2
> term text 	br	12345
> startOffset 	0	2
> endOffset 	2	7
> type 	<ALPHANUM>	<ALPHANUM>
> Query Analyzer
> org.apache.solr.analysis.StandardTokenizerFactory
> {luceneMatchVersion=LUCENE_33}
> position 	1
> term text 	BR12345
> startOffset 	0
> endOffset 	7
> type 	<ALPHANUM>
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true, luceneMatchVersion=LUCENE_33}
> position 	1
> term text 	BR12345
> startOffset 	0
> endOffset 	7
> type 	<ALPHANUM>
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=true, ignoreCase=true, luceneMatchVersion=LUCENE_33}
> position 	1
> term text 	BR12345
> startOffset 	0
> endOffset 	7
> type 	<ALPHANUM>
> org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
> generateNumberParts=1, catenateWords=0, luceneMatchVersion=LUCENE_33,
> generateWordParts=1, catenateAll=0, catenateNumbers=0}
> position 	1	2
> term text 	BR	12345
> startOffset 	0	2
> endOffset 	2	7
> type 	<ALPHANUM>	<ALPHANUM>
> org.apache.solr.analysis.LowerCaseFilterFactory
> {luceneMatchVersion=LUCENE_33}
> position 	1	2
> term text 	br	12345
> startOffset 	0	2
> endOffset 	2	7
> type 	<ALPHANUM>	<ALPHANUM>
> 
> My field type is here
> 
> schema.xml
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <!-- in this example, we will only use synonyms at query time -->
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="1"
> splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> 
> Thanks
> Per
> 
> -------- Original-Nachricht --------
> > Datum: Wed, 20 Jul 2011 17:03:40 -0400
> > Von: Bill Bell <bi...@gmail.com>
> > An: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > CC: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > Betreff: Re: How can i find a document by a special id?
> 
> > Why not just search the 2 fields?
> > 
> > q=*:*&fq=mediacode:AB OR id:123456
> > 
> > You could take the user input and replace it:
> > 
> > q=*:*&fq=mediacode:$input OR id:$input
> > 
> > Of course you can also use dismax and wrap with an OR.
> > 
> > Bill Bell
> > Sent from mobile
> > 
> > 
> > On Jul 20, 2011, at 3:38 PM, Chris Hostetter <ho...@fucit.org>
> > wrote:
> > 
> > > 
> > > : Am 20.07.2011 19:23, schrieb Kyle Lee:
> > > : > Is the mediacode always alphabetic, and is the ID always numeric?
> > > : > 
> > > : No sadly not. We expose our products on "too" many medias :-).
> > > 
> > > If i'm understanding you correctly, you're saying even the prefix "AB"
> > is 
> > > not special, that there could be any number of prefixes identifying 
> > > differnet "mediacodes" ? and the product ids aren't all numeric?
> > > 
> > > your question seems .... absurd.  
> > > 
> > > I can only assume that I am horribly missunderstanding your situation.
>  
> > > (which is very easy to do when you only have a single contrieved piece
> > of 
> > > example data to go on)
> > > 
> > > As a general rule, it's not a good idea to think about Solr in the
> same 
> > > way as a relational database, but Perhaps if you imagine for a moment
> > that 
> > > your Solr index *was* a (read only) relational database, with each 
> > > solr field corrisponding to a column in your DB, and then you
> described
> > in 
> > > psuedo-code/sql how you would go about doing the types of id lookups
> you
> > > want to do, it might give us a better idea of your situation so we can
> > > suggest an approach for dealing with it.
> > > 
> > > 
> > > -Hoss
> 
> -- 
> NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie!		
> Jetzt informieren: http://www.gmx.net/de/go/freephone

-- 
NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie!		
Jetzt informieren: http://www.gmx.net/de/go/freephone

Re: How can i find a document by a special id?

Posted by Per Newgro <pe...@gmx.ch>.

The problem is that i didn't store the mediacode in a field. Because the code is used frequently for getting the customer source.

So far i've found the "solr.WordDelimiterFilterFactory" which is (from Wiki) the way to go. The problem seems to be that i'm searching a "longer" string then i've indexed. I only index the numeric id (12345).
But query string is BR12345. I don't get any results. Can i fine-tune
the WDFF somehow? 

By using the admin/analysis.jsp 

Index Analyzer
org.apache.solr.analysis.StandardTokenizerFactory {luceneMatchVersion=LUCENE_33}
position 	1
term text 	BR12345
startOffset 	0
endOffset 	7
type 	<ALPHANUM>
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true, luceneMatchVersion=LUCENE_33}
position 	1
term text 	BR12345
startOffset 	0
endOffset 	7
type 	<ALPHANUM>
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33, generateWordParts=1, catenateAll=0, catenateNumbers=1}
position 	1	2
term text 	BR	12345
startOffset 	0	2
endOffset 	2	7
type 	<ALPHANUM>	<ALPHANUM>
org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_33}
position 	1	2
term text 	br	12345
startOffset 	0	2
endOffset 	2	7
type 	<ALPHANUM>	<ALPHANUM>
Query Analyzer
org.apache.solr.analysis.StandardTokenizerFactory {luceneMatchVersion=LUCENE_33}
position 	1
term text 	BR12345
startOffset 	0
endOffset 	7
type 	<ALPHANUM>
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true, luceneMatchVersion=LUCENE_33}
position 	1
term text 	BR12345
startOffset 	0
endOffset 	7
type 	<ALPHANUM>
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true, luceneMatchVersion=LUCENE_33}
position 	1
term text 	BR12345
startOffset 	0
endOffset 	7
type 	<ALPHANUM>
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, luceneMatchVersion=LUCENE_33, generateWordParts=1, catenateAll=0, catenateNumbers=0}
position 	1	2
term text 	BR	12345
startOffset 	0	2
endOffset 	2	7
type 	<ALPHANUM>	<ALPHANUM>
org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_33}
position 	1	2
term text 	br	12345
startOffset 	0	2
endOffset 	2	7
type 	<ALPHANUM>	<ALPHANUM>

My field type is here

schema.xml
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <!-- in this example, we will only use synonyms at query time -->
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


Thanks
Per

-------- Original-Nachricht --------
> Datum: Wed, 20 Jul 2011 17:03:40 -0400
> Von: Bill Bell <bi...@gmail.com>
> An: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> CC: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Betreff: Re: How can i find a document by a special id?

> Why not just search the 2 fields?
> 
> q=*:*&fq=mediacode:AB OR id:123456
> 
> You could take the user input and replace it:
> 
> q=*:*&fq=mediacode:$input OR id:$input
> 
> Of course you can also use dismax and wrap with an OR.
> 
> Bill Bell
> Sent from mobile
> 
> 
> On Jul 20, 2011, at 3:38 PM, Chris Hostetter <ho...@fucit.org>
> wrote:
> 
> > 
> > : Am 20.07.2011 19:23, schrieb Kyle Lee:
> > : > Is the mediacode always alphabetic, and is the ID always numeric?
> > : > 
> > : No sadly not. We expose our products on "too" many medias :-).
> > 
> > If i'm understanding you correctly, you're saying even the prefix "AB"
> is 
> > not special, that there could be any number of prefixes identifying 
> > differnet "mediacodes" ? and the product ids aren't all numeric?
> > 
> > your question seems .... absurd.  
> > 
> > I can only assume that I am horribly missunderstanding your situation.  
> > (which is very easy to do when you only have a single contrieved piece
> of 
> > example data to go on)
> > 
> > As a general rule, it's not a good idea to think about Solr in the same 
> > way as a relational database, but Perhaps if you imagine for a moment
> that 
> > your Solr index *was* a (read only) relational database, with each 
> > solr field corrisponding to a column in your DB, and then you described
> in 
> > psuedo-code/sql how you would go about doing the types of id lookups you
> > want to do, it might give us a better idea of your situation so we can 
> > suggest an approach for dealing with it.
> > 
> > 
> > -Hoss

-- 
NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie!		
Jetzt informieren: http://www.gmx.net/de/go/freephone

Re: How can i find a document by a special id?

Posted by Bill Bell <bi...@gmail.com>.

Why not just search the 2 fields?

q=*:*&fq=mediacode:AB OR id:123456

You could take the user input and replace it:

q=*:*&fq=mediacode:$input OR id:$input

Of course you can also use dismax and wrap with an OR.

Bill Bell
Sent from mobile


On Jul 20, 2011, at 3:38 PM, Chris Hostetter <ho...@fucit.org> wrote:

> 
> : Am 20.07.2011 19:23, schrieb Kyle Lee:
> : > Is the mediacode always alphabetic, and is the ID always numeric?
> : > 
> : No sadly not. We expose our products on "too" many medias :-).
> 
> If i'm understanding you correctly, you're saying even the prefix "AB" is 
> not special, that there could be any number of prefixes identifying 
> differnet "mediacodes" ? and the product ids aren't all numeric?
> 
> your question seems .... absurd.  
> 
> I can only assume that I am horribly missunderstanding your situation.  
> (which is very easy to do when you only have a single contrieved piece of 
> example data to go on)
> 
> As a general rule, it's not a good idea to think about Solr in the same 
> way as a relational database, but Perhaps if you imagine for a moment that 
> your Solr index *was* a (read only) relational database, with each 
> solr field corrisponding to a column in your DB, and then you described in 
> psuedo-code/sql how you would go about doing the types of id lookups you 
> want to do, it might give us a better idea of your situation so we can 
> suggest an approach for dealing with it.
> 
> 
> -Hoss

Re: How can i find a document by a special id?

Posted by Chris Hostetter <ho...@fucit.org>.

: Am 20.07.2011 19:23, schrieb Kyle Lee:
: > Is the mediacode always alphabetic, and is the ID always numeric?
: > 
: No sadly not. We expose our products on "too" many medias :-).

If i'm understanding you correctly, you're saying even the prefix "AB" is 
not special, that there could be any number of prefixes identifying 
differnet "mediacodes" ? and the product ids aren't all numeric?

your question seems .... absurd.  

I can only assume that I am horribly missunderstanding your situation.  
(which is very easy to do when you only have a single contrieved piece of 
example data to go on)

As a general rule, it's not a good idea to think about Solr in the same 
way as a relational database, but Perhaps if you imagine for a moment that 
your Solr index *was* a (read only) relational database, with each 
solr field corrisponding to a column in your DB, and then you described in 
psuedo-code/sql how you would go about doing the types of id lookups you 
want to do, it might give us a better idea of your situation so we can 
suggest an approach for dealing with it.


-Hoss

Re: How can i find a document by a special id?

Posted by Per Newgro <pe...@gmx.ch>.

Am 20.07.2011 19:23, schrieb Kyle Lee:
> Is the mediacode always alphabetic, and is the ID always numeric?
>
No sadly not. We expose our products on "too" many medias :-).

Per

Re: How can i find a document by a special id?

Posted by Kyle Lee <ra...@gmail.com>.

Is the mediacode always alphabetic, and is the ID always numeric?

Re: How can i find a document by a special id?

Posted by Per Newgro <pe...@gmx.ch>.

Am 20.07.2011 18:03, schrieb Kyle Lee:
> Perhaps I'm missing something, but if your fields are indexed as "1234567"
> but users are searching for "AB1234567," is it not possible simply to strip
> the prefix from the user's input before sending the request?
>
> On Wed, Jul 20, 2011 at 10:57 AM, Per Newgro<pe...@gmx.ch>  wrote:
>
>> Hi,
>>
>> i'm new to solr. I built an application using the standard solr 3.3
>> examples as default.
>> My id field is a string and is copied to a solr.TextField ("searchtext")
>> for search queries.
>> All works fine except i try to get documents by a special id.
>>
>> Let me explain the detail's. Assume id = "1234567". I would like to query
>> this document
>> by using q=searchtext:AB1234567. The prefix ("AB") is acting as a pseudo-id
>> in our
>> system. Users know and search for it. But it's not findable because
>> solr-index only knows
>> the "short id".
>>
>> Adding a new document with the prefixed-id as id is not an option. Then i
>> have to add
>> many documents.
>>
>> For my understanding stemming and ngram tokenizing is not possible
>> because they act on tokens longer then the search token.
>>
>> How can i do this?
>>
>> Thanks
>> Per
>>
Sorry for being not clear here. I only use a single search field. It can 
contain multiple search words.
One of them is the id. So i don't realy know that the search word is an id.
The usecase is: We have a product database with some items. The product 
has an id, name, features
etc. They all go in the described serachtext field. We promote our 
products in different medias. So every
product can have a mediaid (AB is mediacode 1234567 is the id). And 
users should be able to find
the product by id and mediaid.

I hope i could explain myself better.

Thanks for helping me
Per

Re: How can i find a document by a special id?

Posted by Kyle Lee <ra...@gmail.com>.

Perhaps I'm missing something, but if your fields are indexed as "1234567"
but users are searching for "AB1234567," is it not possible simply to strip
the prefix from the user's input before sending the request?

On Wed, Jul 20, 2011 at 10:57 AM, Per Newgro <pe...@gmx.ch> wrote:

> Hi,
>
> i'm new to solr. I built an application using the standard solr 3.3
> examples as default.
> My id field is a string and is copied to a solr.TextField ("searchtext")
> for search queries.
> All works fine except i try to get documents by a special id.
>
> Let me explain the detail's. Assume id = "1234567". I would like to query
> this document
> by using q=searchtext:AB1234567. The prefix ("AB") is acting as a pseudo-id
> in our
> system. Users know and search for it. But it's not findable because
> solr-index only knows
> the "short id".
>
> Adding a new document with the prefixed-id as id is not an option. Then i
> have to add
> many documents.
>
> For my understanding stemming and ngram tokenizing is not possible
> because they act on tokens longer then the search token.
>
> How can i do this?
>
> Thanks
> Per
>

Re: Solr not returning results for some key words

Posted by Matthew Twomey <mt...@beakstar.com>.

Ok, apparently I'm not the first to have fallen prey to maxFieldLength 
gotcha:

http://lucene.472066.n3.nabble.com/Solr-ignoring-maxFieldLength-td473263.html

All fixed now.

-Matt

On 07/20/2011 07:13 PM, Matthew Twomey wrote:
> Greetings,
>
> I'm having trouble getting Solr to return results for key words that I 
> know for sure are in the index. As a test, I've indexed a PDF of a 
> book on Java. I'm trying to search the index for 
> "UnsupportedOperationException" but I get no results. I can "see" it 
> in the index though:
>
> #####
> [root@myhost apache-solr-1.4.1]# strings 
> example/solr/data/index/_0.fdt|grep UnsupportedOperationException
> UnsupportedOperationException if the iterator returned by this collec-
> throw new UnsupportedOperationException();
> UnsupportedOperationException Object does not support method    
> CHAPTER 9 EXCEPTIONS
> UnsupportedOperationException, 87,
> [root@myhost apache-solr-1.4.1]#
> #####
>
> On the other hand, if I search the index for the word "support" (which 
> is also contained in the grep above), I get a hit on this document. 
> Furthermore, if I search on "support" and include highlighted 
> snippets, I can see the word "UnsupportedOperationException" right in 
> there in the highlight results!
>
> #####
> of an object has
> been detected where it is prohibited
> UnsupportedOperationException Object does not <em>support</em>
> #####
>
> So why do I get no hits when I search for it?
>
> This happens with many different key words. Any thoughts on how I can 
> trouble shoot this or ideas on why it's not working properly?
>
> Thanks,
>
> -Matt

Solr not returning results for some key words

Posted by Matthew Twomey <mt...@beakstar.com>.

Greetings,

I'm having trouble getting Solr to return results for key words that I 
know for sure are in the index. As a test, I've indexed a PDF of a book 
on Java. I'm trying to search the index for 
"UnsupportedOperationException" but I get no results. I can "see" it in 
the index though:

#####
[root@myhost apache-solr-1.4.1]# strings 
example/solr/data/index/_0.fdt|grep UnsupportedOperationException
UnsupportedOperationException if the iterator returned by this collec-
throw new UnsupportedOperationException();
UnsupportedOperationException Object does not support method    CHAPTER 
9 EXCEPTIONS
UnsupportedOperationException, 87,
[root@myhost apache-solr-1.4.1]#
#####

On the other hand, if I search the index for the word "support" (which 
is also contained in the grep above), I get a hit on this document. 
Furthermore, if I search on "support" and include highlighted snippets, 
I can see the word "UnsupportedOperationException" right in there in the 
highlight results!

#####
of an object has
been detected where it is prohibited
UnsupportedOperationException Object does not <em>support</em>
#####

So why do I get no hits when I search for it?

This happens with many different key words. Any thoughts on how I can 
trouble shoot this or ideas on why it's not working properly?

Thanks,

-Matt