You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Carsten Schnober <sc...@ids-mannheim.de> on 2012/11/19 17:44:38 UTC

TokenStreamComponents in Lucene 4.0

Hi,
I have recently updated to Lucene 4.0, but having problems with my
custom Analyzer/Tokenizer.

In the days of Lucene 3.6, it would work like this:

0. define constants lucene_version and indexdir
1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer)
2. create an IndexWriterConfiguration: config = new
IndexWriterConfig(lucene_version, analyzer)
3. create an IndexWriter writer = (indexdir, config)
4. for each document:
4.1. create a Document: Document doc = new Document()
4.2. create a Field: Field field = new Field("text", layerFile,
Field.Store.YES, Field.Index.ANALYZED_NO_NORMS,
Field.TermVector.WITH_POSITIONS_OFFSETS);
4.3. add field to document: doc.add(field)
4.4. add document to writer: writer.add(doc)
5. close the writer (write to disk)

However, after switching to Lucene 4 and TokenStreamComponents, I'm
getting a strange behaviour: only the first document in the collection
is tokenized properly. The others do appear in the index, but
un-tokenized, although I have tried not to change anything in the logic.
The Analyzer now has this createComponents() method calling the custom
TokenStreamComponents class with my custom Tokenizer:

@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
  final Tokenizer source = new KoraTokenizer(reader);
  final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
  try {
    source.close();
  } catch (IOException e) {
    jlog.error(e.getLocalizedMessage());
    e.printStackTrace();
  }
  return tokenstream;
}


The custom TokenStreamComponents class uses this constructor:

public KoraTokenStreamComponents(Tokenizer tokenizer) {
  super(tokenizer);
  try {
    tokenizer.reset();
  } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
  }
}


Since I have not changed anything in the Tokenizer, I suspect the error
to be in the new class KoraTokenStreamComponents. This may be due to the
fact that I do not fully understand why the TokenStreamComponents class
has been introduced.
Any hints on that? Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: TokenStreamComponents in Lucene 4.0

Posted by Carsten Schnober <sc...@ids-mannheim.de>.
Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi again,
just a little update:

> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, although I have tried not to change anything in the logic.
> The Analyzer now has this createComponents() method calling the custom
> TokenStreamComponents class with my custom Tokenizer:
> 
> @Override
> protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>   final Tokenizer source = new KoraTokenizer(reader);
>   final TokenStreamComponents tokenstream = new
> KoraTokenStreamComponents(source);
>   try {
>     source.close();
>   } catch (IOException e) {
>     jlog.error(e.getLocalizedMessage());
>     e.printStackTrace();
>   }
>   return tokenstream;
> }

When using the packaged Analyzer.TokenStreamComponents class instead of
my custom KoraTokenStreamComponents class, the behaviour does not seem
to change:

-  final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
+  final TokenStreamComponents tokenstream = new
TokenStreamComponents(source);

Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: TokenStreamComponents in Lucene 4.0

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Nov 20, 2012 at 6:26 AM, Carsten Schnober
<sc...@ids-mannheim.de>wrote:

>
> Thanks, Uwe!
> I think what changed in comparison to Lucene 3.6 is that reset() is
> called upon initialization, too, instead of after processing the first
> document only, right?


 There is no such change: this step was always mandatory!

Re: TokenStreamComponents in Lucene 4.0

Posted by Carsten Schnober <sc...@ids-mannheim.de>.
Am 20.11.2012 10:22, schrieb Uwe Schindler:

Hi,

> The createComponents() method of Analyzers is only called *once* for each thread and the Tokenstream is *reused* for later documents. The Analyzer will call the final method Tokenizer#setReader() to notify the Tokenizer of a new Reader (this method will update the protected "input" field in the Tokenizer base class) and then it will reset() the whole tokenization chain. The custom TokenStream components must "initialize" themselves with the new settings on the reset() method.

Thanks, Uwe!
I think what changed in comparison to Lucene 3.6 is that reset() is
called upon initialization, too, instead of after processing the first
document only, right? Apart from the fact that it used not to be
obligatory to make all components reuseable, I suppose.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: TokenStreamComponents in Lucene 4.0

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

all the components of your Tokenstream in Lucene 4.0 are *required* tob e reuseable, see the documentation:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Analyzer.html

All your components must implement reset() according to the Tokenstream contract:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

The createComponents() method of Analyzers is only called *once* for each thread and the Tokenstream is *reused* for later documents. The Analyzer will call the final method Tokenizer#setReader() to notify the Tokenizer of a new Reader (this method will update the protected "input" field in the Tokenizer base class) and then it will reset() the whole tokenization chain. The custom TokenStream components must "initialize" themselves with the new settings on the reset() method.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Carsten Schnober [mailto:schnober@ids-mannheim.de]
> Sent: Tuesday, November 20, 2012 10:15 AM
> To: java-user@lucene.apache.org
> Subject: Re: TokenStreamComponents in Lucene 4.0
> 
> Am 19.11.2012 17:44, schrieb Carsten Schnober:
> 
> Hi,
> 
> > However, after switching to Lucene 4 and TokenStreamComponents, I'm
> > getting a strange behaviour: only the first document in the collection
> > is tokenized properly. The others do appear in the index, but
> > un-tokenized, although I have tried not to change anything in the logic.
> > The Analyzer now has this createComponents() method calling the custom
> > TokenStreamComponents class with my custom Tokenizer:
> 
> After some debugging, it turns out that the Analyer method
> createComponents() is called only once, for the first document. This seems
> to be the problem, the other documents are just not analyzed.
> Here's the loop that creates the fields and supposedly calls the analyzer.
> Does anyone have a hint why this does only happend for the first document;
> the loop itself runs once for every document though:
> 
> ---------------------------------------------------------------
> 
> List<de.ids_mannheim.korap.main.Document> documents; Version
> lucene_version = Version.LUCENE_40; Analyzer analyzer = new
> KoraAnalyzer(); IndexWriterConfig config = new
> IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new
> IndexWriter(dir, config); [...]
> 
> for (de.ids_mannheim.korap.main.Document doc : documents) {
>   luceneDocument = new Document();
> 
>   /* Store document name/ID */
>   Field idField = new StringField(titleFieldName, doc.getDocid(),
> Field.Store.YES);
> 
>   /* Store tokens */
>   String layerFile = layer.getFile();
>   Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
> Field.Store.YES);
> 
>   luceneDocument.add(textFieldAnalyzed);
>   luceneDocument.add(idField);
> 
>   try {
>     writer.addDocument(luceneDocument);
>   } catch (IOException e) {
>     jlog.error("Error adding document
> "+doc.getDocid()+":\n"+e.getLocalizedMessage());
>   }
> }
> [...]
> writer.close();
> -------------------------------------------------------------------
> 
> The class de.ids_mannheim.korap.main.Document defines our own
> document objects from which the relevant information can be read as shown
> in the loop. The list 'documents' is filled in in intermediately called method.
> Best,
> Carsten
> 
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> Analysis Platform
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: TokenStreamComponents in Lucene 4.0

Posted by Carsten Schnober <sc...@ids-mannheim.de>.
Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi,

> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, although I have tried not to change anything in the logic.
> The Analyzer now has this createComponents() method calling the custom
> TokenStreamComponents class with my custom Tokenizer:

After some debugging, it turns out that the Analyer method
createComponents() is called only once, for the first document. This
seems to be the problem, the other documents are just not analyzed.
Here's the loop that creates the fields and supposedly calls the
analyzer. Does anyone have a hint why this does only happend for the
first document; the loop itself runs once for every document though:

---------------------------------------------------------------

List<de.ids_mannheim.korap.main.Document> documents;
Version lucene_version = Version.LUCENE_40;
Analyzer analyzer = new KoraAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
[...]

for (de.ids_mannheim.korap.main.Document doc : documents) {
  luceneDocument = new Document();
			
  /* Store document name/ID */
  Field idField = new StringField(titleFieldName, doc.getDocid(),
Field.Store.YES);
			
  /* Store tokens */
  String layerFile = layer.getFile();
  Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
Field.Store.YES);
		
  luceneDocument.add(textFieldAnalyzed);
  luceneDocument.add(idField);
						
  try {
    writer.addDocument(luceneDocument);
  } catch (IOException e) {
    jlog.error("Error adding document
"+doc.getDocid()+":\n"+e.getLocalizedMessage());
  }
}
[...]
writer.close();
-------------------------------------------------------------------

The class de.ids_mannheim.korap.main.Document defines our own document
objects from which the relevant information can be read as shown in the
loop. The list 'documents' is filled in in intermediately called method.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org