You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by dhastings <dh...@wshein.com> on 2011/08/02 17:14:26 UTC
lucene/solr, raw indexing/searching
Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method. I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.
are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do? My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.
for example:
203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.
if i have content that is :
"this is rev. 23.302"
i need it indexed EXACTLY as it appears,
"this is rev. 23.302"
I do not want any of solr or lucenes attempts to "fix" my content or my
queries. "rev." needs to stay "rev." and not turn into "rev", "23.302"
needs to stay as such, and NOT turn into "23302". this is for BOTH indexing
and searching.
any hints?
right now for indexing i have:
Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");
Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
writer.setUseCompoundFile(false) ;
and for searching i have in my schema :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Thanks. Very much appreciated.
--
View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.
RE: lucene/solr, raw indexing/searching
Posted by Craig Stires <cr...@gmail.com>.
dhastings,
my recommendation for the approaches from both sides ...
Lucene:
try on a whitespace analyzer for size
Analyzer an = new WhitespaceAnalyzer(Version.LUCENE_31);
Solr:
in your /index/solr/conf/schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
...
</analyzer>
</fieldType>
-craig
-----Original Message-----
From: dhastings [mailto:dhastings@wshein.com]
Sent: Tuesday, 2 August 2011 10:14 PM
To: solr-user@lucene.apache.org
Subject: lucene/solr, raw indexing/searching
Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method. I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.
are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do? My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.
for example:
203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.
if i have content that is :
"this is rev. 23.302"
i need it indexed EXACTLY as it appears,
"this is rev. 23.302"
I do not want any of solr or lucenes attempts to "fix" my content or my
queries. "rev." needs to stay "rev." and not turn into "rev", "23.302"
needs to stay as such, and NOT turn into "23302". this is for BOTH indexing
and searching.
any hints?
right now for indexing i have:
Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");
Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
writer.setUseCompoundFile(false) ;
and for searching i have in my schema :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Thanks. Very much appreciated.
--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219
277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: lucene/solr, raw indexing/searching
Posted by Jonathan Rochkind <ro...@jhu.edu>.
It depends. Okay, the source contains "4 harv. l. rev. 45" .
Do you want a user entered "harv." to ALSO match "harv" (without the
period) in source, and vice versa? Or do you require it NOT match? Or do
you not care?
The default filter analysis chain will index "4 harv. l. rev. 45"
essentially as 4;harv;l;rev;45. A phrase search for
"4 harv. l. rev. 45" will match it, but so will a phrase search for "4
harv l rev 45" , and in fact so will a phrase search for "4 harv. l. rev45"
That could be good, or it could be bad.
The point of the Solr analysis chain is to apply tokenization and
transformation at both index time and query time, so queries will match
source in the way you want. You can customize this analysis chain
however you want, in extreme cases even writing your own analyzers in
Java. If the out of the box default isn't doing what you want, you'll
have to spend some time thinking about how an inverted index like lucene
works, and what you want. You would need to provide a lot more
specifications/details for someone else to figure out what analysis
chain will do what you want, but I bet you can figure it our yourself
after reading up a bit and thinking up a bit.
See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
On 8/4/2011 4:30 PM, dhastings wrote:
> I have decided to use solr for indexing as well.
>
> the types of searches im doing are professional/academic.
> so for example, i need to match:
> all over the following exactly from my source data:
> "3.56",
> "4 harv. l. rev. 45",
> "187-532",
> "3 llm 56",
> "5 unts 8",
> "6 u.n.t.s. 78",
> "father's obligation"
>
>
> i seem to keep running into issues getting this to work. the searching is
> being done on a text field that is not stored.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: lucene/solr, raw indexing/searching
Posted by dhastings <dh...@wshein.com>.
I have decided to use solr for indexing as well.
the types of searches im doing are professional/academic.
so for example, i need to match:
all over the following exactly from my source data:
"3.56",
"4 harv. l. rev. 45",
"187-532",
"3 llm 56",
"5 unts 8",
"6 u.n.t.s. 78",
"father's obligation"
i seem to keep running into issues getting this to work. the searching is
being done on a text field that is not stored.
--
View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: lucene/solr, raw indexing/searching
Posted by Erick Erickson <er...@gmail.com>.
I predict you'll spend a lot of time on the admin/analysis page
understanding what the various combinations of tokenizers and filters do.
Because, you see, you already have differences, to whit: your Solr schema
has LowercaseFilter and removeDuplicates.
Have you determined *why* Solr indexing is slower? You might consider using
SolrJ and firing multiple threads/processes at the issue to bring indexing
performance up to acceptable levels and avoid this problem entirely....
Best
Erick
On Aug 2, 2011 12:37 PM, "Jonathan Rochkind" <ro...@jhu.edu> wrote:
> In your solr schema.xml, are the fields you are using defined as text
> fields with analyzers? It sounds like you want no analysis at all, which
> probably means you don't want text fields either, you just want string
> fields. That will make it impossible to search for individual tokens
> though, searches will match only on complete matches of the value.
>
> I'm not quite sure how to do what you want, it depends on exactly what
> you want. What kind of searching do you expect to support? If you still
> do want tokenization, you'll still want some analysis... but I'm not
> quite sure how that corresponds to what you'd want to do on the lucene
> end. What you're trying to do is going to be inevitably confusing, I
> think. Which doesn't mean it's not possible. You might find it less
> confusing if you were willing to use Solr to index though, rather than
> straight lucene -- you could use Solr via the SolrJ java classes, rather
> than the HTTP interface.
>
> On 8/2/2011 11:14 AM, dhastings wrote:
>> Hello,
>> I am trying to get lucene and solr to agree on a completely Raw indexing
>> method. I use lucene in my indexers that write to an index on disk, and
>> solr to search those indexes that i create, as creating the indexes
without
>> solr is much much faster than using the solr server.
>>
>> are there settings for BOTH solr and lucene to use EXACTLY whats in the
>> content as opposed to interpreting what it thinks im trying to do? My
>> content is extremely specific and needs no interpretation or adjustment,
>> indexing or searching, a text field.
>>
>> for example:
>>
>> 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
>> correctly, but then it wont find whats indexed using 3.1's standard
>> analyzer.
>>
>> if i have content that is :
>> "this is rev. 23.302"
>>
>> i need it indexed EXACTLY as it appears,
>> "this is rev. 23.302"
>>
>> I do not want any of solr or lucenes attempts to "fix" my content or my
>> queries. "rev." needs to stay "rev." and not turn into "rev", "23.302"
>> needs to stay as such, and NOT turn into "23302". this is for BOTH
indexing
>> and searching.
>>
>> any hints?
>>
>> right now for indexing i have:
>>
>> Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");
>>
>> Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
>> writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
>> writer.setUseCompoundFile(false) ;
>>
>>
>> and for searching i have in my schema :
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer>
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>>
>> Thanks. Very much appreciated.
>>
>>
>> --
>> View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
Re: lucene/solr, raw indexing/searching
Posted by Jonathan Rochkind <ro...@jhu.edu>.
In your solr schema.xml, are the fields you are using defined as text
fields with analyzers? It sounds like you want no analysis at all, which
probably means you don't want text fields either, you just want string
fields. That will make it impossible to search for individual tokens
though, searches will match only on complete matches of the value.
I'm not quite sure how to do what you want, it depends on exactly what
you want. What kind of searching do you expect to support? If you still
do want tokenization, you'll still want some analysis... but I'm not
quite sure how that corresponds to what you'd want to do on the lucene
end. What you're trying to do is going to be inevitably confusing, I
think. Which doesn't mean it's not possible. You might find it less
confusing if you were willing to use Solr to index though, rather than
straight lucene -- you could use Solr via the SolrJ java classes, rather
than the HTTP interface.
On 8/2/2011 11:14 AM, dhastings wrote:
> Hello,
> I am trying to get lucene and solr to agree on a completely Raw indexing
> method. I use lucene in my indexers that write to an index on disk, and
> solr to search those indexes that i create, as creating the indexes without
> solr is much much faster than using the solr server.
>
> are there settings for BOTH solr and lucene to use EXACTLY whats in the
> content as opposed to interpreting what it thinks im trying to do? My
> content is extremely specific and needs no interpretation or adjustment,
> indexing or searching, a text field.
>
> for example:
>
> 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
> correctly, but then it wont find whats indexed using 3.1's standard
> analyzer.
>
> if i have content that is :
> "this is rev. 23.302"
>
> i need it indexed EXACTLY as it appears,
> "this is rev. 23.302"
>
> I do not want any of solr or lucenes attempts to "fix" my content or my
> queries. "rev." needs to stay "rev." and not turn into "rev", "23.302"
> needs to stay as such, and NOT turn into "23302". this is for BOTH indexing
> and searching.
>
> any hints?
>
> right now for indexing i have:
>
> Set nostopwords = new HashSet(); nostopwords.add("buahahahahahaha");
>
> Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
> writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
> writer.setUseCompoundFile(false) ;
>
>
> and for searching i have in my schema :
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
>
> Thanks. Very much appreciated.
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>