You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by dhastings <dh...@wshein.com> on 2011/08/02 17:14:26 UTC

lucene/solr, raw indexing/searching

Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method.  I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do?  My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031.  searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
"this is rev. 23.302"

i need it indexed EXACTLY as it appears,
"this is rev. 23.302"

I do not want any of solr or lucenes attempts to "fix" my content or my
queries.  "rev." needs to stay "rev." and not turn into "rev", "23.302"
needs to stay as such, and NOT turn into "23302".  this is for BOTH indexing
and searching.  

any hints?

right now for indexing i have:

        Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer  = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);                                
writer.setUseCompoundFile(false) ;


and for searching i have in my schema :


 <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
     
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


Thanks.  Very much appreciated.


--
View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: lucene/solr, raw indexing/searching

Posted by Craig Stires <cr...@gmail.com>.

dhastings,

my recommendation for the approaches from both sides ...

Lucene:
try on a whitespace analyzer for size

   Analyzer an = new WhitespaceAnalyzer(Version.LUCENE_31);


Solr:
in your /index/solr/conf/schema.xml

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        ...
     </analyzer>
   </fieldType>


-craig


-----Original Message-----
From: dhastings [mailto:dhastings@wshein.com] 
Sent: Tuesday, 2 August 2011 10:14 PM
To: solr-user@lucene.apache.org
Subject: lucene/solr, raw indexing/searching

Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method.  I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do?  My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031.  searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
"this is rev. 23.302"

i need it indexed EXACTLY as it appears,
"this is rev. 23.302"

I do not want any of solr or lucenes attempts to "fix" my content or my
queries.  "rev." needs to stay "rev." and not turn into "rev", "23.302"
needs to stay as such, and NOT turn into "23302".  this is for BOTH indexing
and searching.  

any hints?

right now for indexing i have:

        Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer  = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);

writer.setUseCompoundFile(false) ;


and for searching i have in my schema :


 <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
     
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


Thanks.  Very much appreciated.


--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219
277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: lucene/solr, raw indexing/searching

Posted by Jonathan Rochkind <ro...@jhu.edu>.

It depends. Okay, the source contains "4 harv. l. rev. 45" .

Do you want a user entered "harv." to ALSO match "harv" (without the 
period) in source, and vice versa? Or do you require it NOT match? Or do 
you not care?

The default filter analysis chain will index "4 harv. l. rev. 45" 
essentially as 4;harv;l;rev;45.  A phrase search for
"4 harv. l. rev. 45" will match it, but so will a phrase search for "4 
harv l rev 45" , and in fact so will a phrase search for "4 harv. l. rev45"

That could be good, or it could be bad.

The point of the Solr analysis chain is to apply tokenization and 
transformation at both index time and query time, so queries will match 
source in the way you want. You can customize this analysis chain 
however you want, in extreme cases even writing your own analyzers in 
Java. If the out of the box default isn't doing what you want, you'll 
have to spend some time thinking about how an inverted index like lucene 
works, and what you want. You would need to provide a lot more 
specifications/details for someone else to figure out what analysis 
chain will do what you want, but I bet you can figure it our yourself 
after reading up a bit and thinking up a bit.

See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

  On 8/4/2011 4:30 PM, dhastings wrote:
> I have decided to use solr for indexing as well.
>
> the types of searches im doing are professional/academic.
> so for example, i need to match:
> all over the following exactly from my source data:
>      "3.56",
>       "4 harv. l. rev. 45",
>       "187-532",
>      "3 llm 56",
>       "5 unts 8",
>      "6 u.n.t.s. 78",
>      "father's obligation"
>
>
> i seem to keep running into issues getting this to work.  the searching is
> being done on a text field that is not stored.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: lucene/solr, raw indexing/searching

Posted by dhastings <dh...@wshein.com>.

I have decided to use solr for indexing as well.  

the types of searches im doing are professional/academic.
so for example, i need to match:
all over the following exactly from my source data:
    "3.56",
     "4 harv. l. rev. 45",
     "187-532",
    "3 llm 56",
     "5 unts 8",
    "6 u.n.t.s. 78",
    "father's obligation"


i seem to keep running into issues getting this to work.  the searching is
being done on a text field that is not stored.

--
View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: lucene/solr, raw indexing/searching

Posted by Erick Erickson <er...@gmail.com>.

I predict you'll spend a lot of time on the admin/analysis page
understanding what the various combinations of tokenizers and filters do.
Because, you see, you already have differences, to whit: your Solr schema
has LowercaseFilter and removeDuplicates.

Have you determined *why* Solr indexing is slower? You might consider using
SolrJ and firing multiple threads/processes at the issue to bring indexing
performance up to acceptable levels and avoid this problem entirely....

Best
Erick
On Aug 2, 2011 12:37 PM, "Jonathan Rochkind" <ro...@jhu.edu> wrote:
> In your solr schema.xml, are the fields you are using defined as text
> fields with analyzers? It sounds like you want no analysis at all, which
> probably means you don't want text fields either, you just want string
> fields. That will make it impossible to search for individual tokens
> though, searches will match only on complete matches of the value.
>
> I'm not quite sure how to do what you want, it depends on exactly what
> you want. What kind of searching do you expect to support? If you still
> do want tokenization, you'll still want some analysis... but I'm not
> quite sure how that corresponds to what you'd want to do on the lucene
> end. What you're trying to do is going to be inevitably confusing, I
> think. Which doesn't mean it's not possible. You might find it less
> confusing if you were willing to use Solr to index though, rather than
> straight lucene -- you could use Solr via the SolrJ java classes, rather
> than the HTTP interface.
>
> On 8/2/2011 11:14 AM, dhastings wrote:
>> Hello,
>> I am trying to get lucene and solr to agree on a completely Raw indexing
>> method. I use lucene in my indexers that write to an index on disk, and
>> solr to search those indexes that i create, as creating the indexes
without
>> solr is much much faster than using the solr server.
>>
>> are there settings for BOTH solr and lucene to use EXACTLY whats in the
>> content as opposed to interpreting what it thinks im trying to do? My
>> content is extremely specific and needs no interpretation or adjustment,
>> indexing or searching, a text field.
>>
>> for example:
>>
>> 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
>> correctly, but then it wont find whats indexed using 3.1's standard
>> analyzer.
>>
>> if i have content that is :
>> "this is rev. 23.302"
>>
>> i need it indexed EXACTLY as it appears,
>> "this is rev. 23.302"
>>
>> I do not want any of solr or lucenes attempts to "fix" my content or my
>> queries. "rev." needs to stay "rev." and not turn into "rev", "23.302"
>> needs to stay as such, and NOT turn into "23302". this is for BOTH
indexing
>> and searching.
>>
>> any hints?
>>
>> right now for indexing i have:
>>
>> Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");
>>
>> Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
>> writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
>> writer.setUseCompoundFile(false) ;
>>
>>
>> and for searching i have in my schema :
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer>
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>>
>> Thanks. Very much appreciated.
>>
>>
>> --
>> View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: lucene/solr, raw indexing/searching

Posted by Jonathan Rochkind <ro...@jhu.edu>.

In your solr schema.xml, are the fields you are using defined as text 
fields with analyzers? It sounds like you want no analysis at all, which 
probably means you don't want text fields either, you just want string 
fields. That will make it impossible to search for individual tokens 
though, searches will match only on complete matches of the value.

I'm not quite sure how to do what you want, it depends on exactly what 
you want. What kind of searching do you expect to support?  If you still 
do want tokenization, you'll still want some analysis... but I'm not 
quite sure how that corresponds to what you'd want to do on the lucene 
end.  What you're trying to do is going to be inevitably confusing, I 
think. Which doesn't mean it's not possible.  You might find it less 
confusing if you were willing to use Solr to index though, rather than 
straight lucene -- you could use Solr via the SolrJ java classes, rather 
than the HTTP interface.

On 8/2/2011 11:14 AM, dhastings wrote:
> Hello,
> I am trying to get lucene and solr to agree on a completely Raw indexing
> method.  I use lucene in my indexers that write to an index on disk, and
> solr to search those indexes that i create, as creating the indexes without
> solr is much much faster than using the solr server.
>
> are there settings for BOTH solr and lucene to use EXACTLY whats in the
> content as opposed to interpreting what it thinks im trying to do?  My
> content is extremely specific and needs no interpretation or adjustment,
> indexing or searching, a text field.
>
> for example:
>
> 203.1 seems to be indexed as 2031.  searching for 203.1 i can get to work
> correctly, but then it wont find whats indexed using 3.1's standard
> analyzer.
>
> if i have content that is :
> "this is rev. 23.302"
>
> i need it indexed EXACTLY as it appears,
> "this is rev. 23.302"
>
> I do not want any of solr or lucenes attempts to "fix" my content or my
> queries.  "rev." needs to stay "rev." and not turn into "rev", "23.302"
> needs to stay as such, and NOT turn into "23302".  this is for BOTH indexing
> and searching.
>
> any hints?
>
> right now for indexing i have:
>
>          Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");
>
> Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
> writer  = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
> writer.setUseCompoundFile(false) ;
>
>
> and for searching i have in my schema :
>
>
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>         <analyzer>
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>      </fieldType>
>
>
> Thanks.  Very much appreciated.
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>