You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Wang <jo...@gmail.com> on 2008/03/13 01:40:35 UTC

indexing api wrt Analyzer

Hi all:

    Maybe this has been asked before:

    I am building an index consists of multiple languages, (stored as a
field), and I have different analyzers depending on the language of the
language to be indexed. But the IndexWriter takes only an Analyzer.

    I was hoping to have IndexWriter take an AnalyzerFactory, where the
AnalyzerFactory produces Analyzer depending on some criteria of the
document, e.g. language.

    Maybe I am going about the wrong way.

    Any suggestions on how to go about?

Thanks

-John

Re: indexing api wrt Analyzer

Posted by Daniel Noll <da...@nuix.com>.

On Thursday 13 March 2008 15:21:19 Asgeir Frimannsson wrote:
> >    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> > AnalyzerFactory produces Analyzer depending on some criteria of the
> > document, e.g. language.

> With PerFieldAnalyzerWrapper, you can specify which analyzer to use with
> each field, as well as a default analyzer.

Certainly this would work as long as you store each language in a different 
Lucene field.  This is probably a good idea anyway as it will be easier for 
the QueryParser where there won't necessarily be enough text to determine the 
language easily.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing api wrt Analyzer

Posted by Asgeir Frimannsson <as...@gmail.com>.

On Thu, Mar 13, 2008 at 10:40 AM, John Wang <jo...@gmail.com> wrote:

> Hi all:
>
>    Maybe this has been asked before:
>
>    I am building an index consists of multiple languages, (stored as a
> field), and I have different analyzers depending on the language of the
> language to be indexed. But the IndexWriter takes only an Analyzer.
>
>    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> AnalyzerFactory produces Analyzer depending on some criteria of the
> document, e.g. language.
>
>    Maybe I am going about the wrong way.
>
>    Any suggestions on how to go about?
>

Perhaps this is what you are searching for:

http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

With PerFieldAnalyzerWrapper, you can specify which analyzer to use with
each field, as well as a default analyzer.

cheers,
asgeir

Re: indexing api wrt Analyzer

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 13, 2008, at 11:03 AM, John Wang wrote:

> Yes, but usually it's a good idea to add documents in batch and not  
> having
> to reinstantiate the writer for every document and then closing it.
>

Why does what I suggested require instantiating a new writer for every  
document?  It uses the analyzer you pass in w/ the method:

IndexWriter writer = new IndexWriter(dir, defaultAnalyzer,....)

while adding docs
    Document doc = ...
    Analyzer analyzer = getAnalyzer(language)
    writer.addDocument(doc, analyzer)

writer.close()

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing api wrt Analyzer

Posted by John Wang <jo...@gmail.com>.

Excellent!
Exactly what I was looking for!

Thanks Grant!

-John

On Thu, Mar 13, 2008 at 5:39 PM, Grant Ingersoll <gs...@apache.org>
wrote:

> There is an addDocument method that takes an Analyzer and overrides
> the one used at construction of the IndexWriter.  See
>
> http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument(org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer)<http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument%28org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer%29>
> .
>
>
>
> On Mar 13, 2008, at 4:12 PM, John Wang wrote:
>
> > Hi Grant:
> >
> >    For our corpus, we don't rely on idf in scoring calculation that
> > much,
> > so I don't see that being a problem that much.
> >
> >    About performance, instantiating 1 indexWriter for a batch of say
> > 1000
> > docs, e.g. iterate over 1000 docs and do addDocument; comparing with
> > instantiating and closing 1000 indexWriters each doing 1
> > addDocument. Are
> > you saying the expected performance is the same? I thought when you
> > call
> > addDocument, it adds to memory and flush when segment needs to be
> > merged or
> > writer closes.
> >
> >    Maybe I am missing something.
> >
> > Thanks
> >
> > -john
> >
> > On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll
> > <gs...@apache.org>
> > wrote:
> >
> >>
> >> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
> >>
> >>> Yes, but usually it's a good idea to add documents in batch and not
> >>> having
> >>> to reinstantiate the writer for every document and then closing it.
> >>>
> >>> It would be nice if one can specify to the writer which analyzer to
> >>> use.
> >>>
> >>> PerfieldAnalyzer wouldn't work because different analyzers may apply
> >>> on the
> >>> same field depending on the doc, e.g.
> >>>
> >>
> >> Also, I don't know that it is wise to put different langs in the same
> >> field.  I can't prove it definitively, but it seems to me your corpus
> >> statistics could be skewed by terms that are spelled the same but
> >> have
> >> different meanings across languages.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing api wrt Analyzer

Posted by Grant Ingersoll <gs...@apache.org>.

There is an addDocument method that takes an Analyzer and overrides  
the one used at construction of the IndexWriter.  See
http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument(org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer) 
.



On Mar 13, 2008, at 4:12 PM, John Wang wrote:

> Hi Grant:
>
>    For our corpus, we don't rely on idf in scoring calculation that  
> much,
> so I don't see that being a problem that much.
>
>    About performance, instantiating 1 indexWriter for a batch of say  
> 1000
> docs, e.g. iterate over 1000 docs and do addDocument; comparing with
> instantiating and closing 1000 indexWriters each doing 1  
> addDocument. Are
> you saying the expected performance is the same? I thought when you  
> call
> addDocument, it adds to memory and flush when segment needs to be  
> merged or
> writer closes.
>
>    Maybe I am missing something.
>
> Thanks
>
> -john
>
> On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll  
> <gs...@apache.org>
> wrote:
>
>>
>> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
>>
>>> Yes, but usually it's a good idea to add documents in batch and not
>>> having
>>> to reinstantiate the writer for every document and then closing it.
>>>
>>> It would be nice if one can specify to the writer which analyzer to
>>> use.
>>>
>>> PerfieldAnalyzer wouldn't work because different analyzers may apply
>>> on the
>>> same field depending on the doc, e.g.
>>>
>>
>> Also, I don't know that it is wise to put different langs in the same
>> field.  I can't prove it definitively, but it seems to me your corpus
>> statistics could be skewed by terms that are spelled the same but  
>> have
>> different meanings across languages.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing api wrt Analyzer

Posted by John Wang <jo...@gmail.com>.

Hi Grant:

    For our corpus, we don't rely on idf in scoring calculation that much,
so I don't see that being a problem that much.

    About performance, instantiating 1 indexWriter for a batch of say 1000
docs, e.g. iterate over 1000 docs and do addDocument; comparing with
instantiating and closing 1000 indexWriters each doing 1 addDocument. Are
you saying the expected performance is the same? I thought when you call
addDocument, it adds to memory and flush when segment needs to be merged or
writer closes.

    Maybe I am missing something.

Thanks

-john

On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll <gs...@apache.org>
wrote:

>
> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
>
> > Yes, but usually it's a good idea to add documents in batch and not
> > having
> > to reinstantiate the writer for every document and then closing it.
> >
> > It would be nice if one can specify to the writer which analyzer to
> > use.
> >
> > PerfieldAnalyzer wouldn't work because different analyzers may apply
> > on the
> > same field depending on the doc, e.g.
> >
>
> Also, I don't know that it is wise to put different langs in the same
> field.  I can't prove it definitively, but it seems to me your corpus
> statistics could be skewed by terms that are spelled the same but have
> different meanings across languages.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing api wrt Analyzer

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 13, 2008, at 11:03 AM, John Wang wrote:

> Yes, but usually it's a good idea to add documents in batch and not  
> having
> to reinstantiate the writer for every document and then closing it.
>
> It would be nice if one can specify to the writer which analyzer to  
> use.
>
> PerfieldAnalyzer wouldn't work because different analyzers may apply  
> on the
> same field depending on the doc, e.g.
>

Also, I don't know that it is wise to put different langs in the same  
field.  I can't prove it definitively, but it seems to me your corpus  
statistics could be skewed by terms that are spelled the same but have  
different meanings across languages.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing api wrt Analyzer

Posted by John Wang <jo...@gmail.com>.

Yes, but usually it's a good idea to add documents in batch and not having
to reinstantiate the writer for every document and then closing it.

It would be nice if one can specify to the writer which analyzer to use.

PerfieldAnalyzer wouldn't work because different analyzers may apply on the
same field depending on the doc, e.g.

if (field1.name.equals("fr"))
    use FrenchAnalyzer on content field
etc.

-John

On Thu, Mar 13, 2008 at 4:53 AM, Grant Ingersoll <gs...@apache.org>
wrote:

> On IndexWriter, you can pass in the Analyzer when you add a Document,
> thus your application can identify the language, choose the analyzer
> for the given doc, and then add the document
>
> See
> public void addDocument(Document doc, Analyzer analyzer)
>
>
> On Mar 12, 2008, at 8:40 PM, John Wang wrote:
>
> > Hi all:
> >
> >    Maybe this has been asked before:
> >
> >    I am building an index consists of multiple languages, (stored as a
> > field), and I have different analyzers depending on the language of
> > the
> > language to be indexed. But the IndexWriter takes only an Analyzer.
> >
> >    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> > AnalyzerFactory produces Analyzer depending on some criteria of the
> > document, e.g. language.
> >
> >    Maybe I am going about the wrong way.
> >
> >    Any suggestions on how to go about?
> >
> > Thanks
> >
> > -John
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing api wrt Analyzer

Posted by Grant Ingersoll <gs...@apache.org>.

On IndexWriter, you can pass in the Analyzer when you add a Document,  
thus your application can identify the language, choose the analyzer  
for the given doc, and then add the document

See
public void addDocument(Document doc, Analyzer analyzer)


On Mar 12, 2008, at 8:40 PM, John Wang wrote:

> Hi all:
>
>    Maybe this has been asked before:
>
>    I am building an index consists of multiple languages, (stored as a
> field), and I have different analyzers depending on the language of  
> the
> language to be indexed. But the IndexWriter takes only an Analyzer.
>
>    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> AnalyzerFactory produces Analyzer depending on some criteria of the
> document, e.g. language.
>
>    Maybe I am going about the wrong way.
>
>    Any suggestions on how to go about?
>
> Thanks
>
> -John

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org