You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Dirk Rothe <d....@semantics.de> on 2016/09/08 08:53:19 UTC

testing PyLucene 6.2

Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:

>
> On Mon, 5 Sep 2016, Dirk Rothe wrote:
>>>  A volunteer is requested to build and test PyLucene's trunk on  
>>> Windows. If noone comes forward, I intend to try to release PyLucene  
>>> 6.2 in a few weeks, still.
>>
>> Nice Job!
>>
>> I've successfully build PyLucene 6.2 on windows. Most tests pass:
>> * skipped the three test_ICU* due to missing "import icu"
>
> Yes, for this you need to install PyICU: https://github.com/ovalhub/pyicu

I'm going to assume this would work for now.

>> * fixed test_PyLucene.py by ignoring open file handles (os.error) in  
>> shutil.rmtree() in Test_PyLuceneWithFSStore.tearDown()
>
> Do you have a patch for me to apply ?

Yes, attached.

>> * then stuff like these in test_PythonDirectory.py
[..]
> Can't make sense of this one, sorry.
>
>> * and this one in test_PythonException.py
[..]
> This one could be because you may not have built JCC in shared mode ?
> I vaguely remember there being a problem with proper cross-boundary  
> exception propagation requiring JCC to be built in shared mode.

jcc.SHARED reports True, so seems OK.

I don't think these Windows glitches are really problematic, and our  
production code runs only in linux environments anyway.
And I'm more interested in whether porting around 3kloc lucene-interfaces  
 from v3.6 goes smoothly.

I've hit the first problematic case with an custom  
PythonAnalyzer/PythonTokenizer where I don't see how to pass the input to  
the Tokenizer implementation.
I thought maybe like this, but PythonTokenizer does not accept an INPUT  
anymore (available in v4.10 and v3.6).

class _Tokenizer(PythonTokenizer):
     def __init__(self, INPUT):
	super(_Tokenizer, self).__init__(INPUT)
         # prepare INPUT
     def incrementToken(self):
         # stuff into termAtt/offsetAtt/posIncrAtt

class Analyzer6(PythonAnalyzer):
     def createComponents(self, fieldName):
         return Analyzer.TokenStreamComponents(_Tokenizer())

The PositionIncrementTestCase is pretty similar but initialized with  
static input. Would be a nice place for an example with dynamic input, I  
think.

This was our 3.6 approach:
class Analyzer3(PythonAnalyzer):
     def tokenStream(self, fieldName, reader):
        data = data_from_reader(reader)
        class _tokenStream(PythonTokenStream):
            def __init__(self):
                 super(_tokenStream, self).__init__()
                 # prepare termAtt/offsetAtt/posIncrAtt
            def incrementToken(self):
                 # stuff from data into termAtt/offsetAtt/posIncrAtt
       return _tokenStream()

Any hints how to get Analyzer6 working?

--dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.
On Fri, 9 Sep 2016, Dirk Rothe wrote:

> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>> 
>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>> think?
>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>> or going the wrong way about it.
>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>> probably not the first one to transition from 3.x to something more 
>>>> recent.
>>>> Please let pylucene-dev@ know what you find out...
>>> 
>>> OK.
>>> 
>>> Making Analyzer.initReader() python-overridable is also important for 
>>> use-cases like this: http://stackoverflow.com/a/10290635
>>> So the patch should be fine independently of my usage/hack.
>> 
>> Actually, your patch is not good enough. You need to add an implementation 
>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>> (search for createComponents() implementations) otherwise, when 
>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>> good, as an aside, if I could make a better error out of that...).
>
> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some imports 
> there, use patch).

Ooh, I forgot about samples. I need to check them and produce a new release 
candidate. Thanks for pointing this out.

Andi..

>
> Shouldn't the need for implementation be optional? I don't understand.
>
> --dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.
On Fri, 9 Sep 2016, Dirk Rothe wrote:

> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>> 
>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>> think?
>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>> or going the wrong way about it.
>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>> probably not the first one to transition from 3.x to something more 
>>>> recent.
>>>> Please let pylucene-dev@ know what you find out...
>>> 
>>> OK.
>>> 
>>> Making Analyzer.initReader() python-overridable is also important for 
>>> use-cases like this: http://stackoverflow.com/a/10290635
>>> So the patch should be fine independently of my usage/hack.
>> 
>> Actually, your patch is not good enough. You need to add an implementation 
>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>> (search for createComponents() implementations) otherwise, when 
>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>> good, as an aside, if I could make a better error out of that...).
>
> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some imports 
> there, use patch).

I see no patch attached probably because the apache mail server strips 
attachments sent to mailing lists. Either you include the patch inline or 
you send it to me as an attachment directly.

Thanks !

Andi..

>
> Shouldn't the need for implementation be optional? I don't understand.
>
> --dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Fri, 9 Sep 2016, Andi Vajda wrote:

>
> On Fri, 9 Sep 2016, Dirk Rothe wrote:
>
>> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>> 
>>> 
>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>> 
>>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>>> 
>>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>>> think?
>>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>>> or going the wrong way about it.
>>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>>> probably not the first one to transition from 3.x to something more 
>>>>> recent.
>>>>> Please let pylucene-dev@ know what you find out...
>>>> 
>>>> OK.
>>>> 
>>>> Making Analyzer.initReader() python-overridable is also important for 
>>>> use-cases like this: http://stackoverflow.com/a/10290635
>>>> So the patch should be fine independently of my usage/hack.
>>> 
>>> Actually, your patch is not good enough. You need to add an implementation 
>>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>>> (search for createComponents() implementations) otherwise, when 
>>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>>> good, as an aside, if I could make a better error out of that...).
>> 
>> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some 
>> imports there, use patch).
>> 
>> Shouldn't the need for implementation be optional? I don't understand.
>
> Once you define a native method on a class, a native method implementation 
> must be provided. JCC does that but that native implementation just invokes 
> the python implementation on the python subclass instance. If that python 
> subclass has no method implementation, the inherited method is invoked again, 
> which in turn calls the native method again, and so on until the stack 
> overflows.
>
> This could maybe be improved at the JCC level but until it is, a Python
> implementation must be provided. The default initReader() method just returns 
> 'reader' and so should the python default implementation.

I just did this so that I could produce a new RC and restart the voting 
process. I added initReader() and all needed implementations in tests and 
samples.

Thanks !

Andi..

>
> Andi..
>
>> 
>> --dirk
>

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.
On Fri, 9 Sep 2016, Dirk Rothe wrote:

> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>> 
>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>> think?
>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>> or going the wrong way about it.
>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>> probably not the first one to transition from 3.x to something more 
>>>> recent.
>>>> Please let pylucene-dev@ know what you find out...
>>> 
>>> OK.
>>> 
>>> Making Analyzer.initReader() python-overridable is also important for 
>>> use-cases like this: http://stackoverflow.com/a/10290635
>>> So the patch should be fine independently of my usage/hack.
>> 
>> Actually, your patch is not good enough. You need to add an implementation 
>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>> (search for createComponents() implementations) otherwise, when 
>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>> good, as an aside, if I could make a better error out of that...).
>
> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some imports 
> there, use patch).
>
> Shouldn't the need for implementation be optional? I don't understand.

Once you define a native method on a class, a native method implementation 
must be provided. JCC does that but that native implementation just invokes 
the python implementation on the python subclass instance. If that python 
subclass has no method implementation, the inherited method is invoked 
again, which in turn calls the native method again, and so on until the 
stack overflows.

This could maybe be improved at the JCC level but until it is, a Python
implementation must be provided. The default initReader() method just 
returns 'reader' and so should the python default implementation.

Andi..

>
> --dirk

Re: testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.
Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:

>
> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>
>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>
>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>  I've made initReader() python-overridable (see patch). What do you  
>>>> think?
>>>  Not sure what to think. While your change looks fine, if Lucene  
>>> decided to make this 'hard', it may be a sign that you're doing  
>>> something wrong or going the wrong way about it.
>>>  I suggest you ask on the java-user@lucene.apache.org list as you're  
>>> probably not the first one to transition from 3.x to something more  
>>> recent.
>>>  Please let pylucene-dev@ know what you find out...
>>
>> OK.
>>
>> Making Analyzer.initReader() python-overridable is also important for  
>> use-cases like this: http://stackoverflow.com/a/10290635
>> So the patch should be fine independently of my usage/hack.
>
> Actually, your patch is not good enough. You need to add an  
> implementation for initReader() in all the tests that make a subclass of  
> PythonAnalyzer (search for createComponents() implementations)  
> otherwise, when initReader() gets called from Java, you'll get a stack  
> overflow (it'd be good, as an aside, if I could make a better error out  
> of that...).

OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some  
imports there, use patch).

Shouldn't the need for implementation be optional? I don't understand.

--dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.
On Thu, 8 Sep 2016, Dirk Rothe wrote:

> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>> 
>>> I've made initReader() python-overridable (see patch). What do you think?
>> 
>> Not sure what to think. While your change looks fine, if Lucene decided to 
>> make this 'hard', it may be a sign that you're doing something wrong or 
>> going the wrong way about it.
>> 
>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>> probably not the first one to transition from 3.x to something more recent.
>> 
>> Please let pylucene-dev@ know what you find out...
>
> OK.
>
> Making Analyzer.initReader() python-overridable is also important for 
> use-cases like this: http://stackoverflow.com/a/10290635
> So the patch should be fine independently of my usage/hack.

Actually, your patch is not good enough. You need to add an implementation 
for initReader() in all the tests that make a subclass of PythonAnalyzer 
(search for createComponents() implementations) otherwise, when initReader() 
gets called from Java, you'll get a stack overflow (it'd be good, as an 
aside, if I could make a better error out of that...).

Thanks !

Andi..

>
> --dirk
>

Re: testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.
Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:

>>>  On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>
>> I've made initReader() python-overridable (see patch). What do you  
>> think?
>
> Not sure what to think. While your change looks fine, if Lucene decided  
> to make this 'hard', it may be a sign that you're doing something wrong  
> or going the wrong way about it.
>
> I suggest you ask on the java-user@lucene.apache.org list as you're  
> probably not the first one to transition from 3.x to something more  
> recent.
>
> Please let pylucene-dev@ know what you find out...

OK.

Making Analyzer.initReader() python-overridable is also important for  
use-cases like this: http://stackoverflow.com/a/10290635
So the patch should be fine independently of my usage/hack.

--dirk


Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.
On Thu, 8 Sep 2016, Dirk Rothe wrote:

> Am 08.09.2016, 11:10 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>>> class _Tokenizer(PythonTokenizer):
>>>  def __init__(self, INPUT):
>>> 	super(_Tokenizer, self).__init__(INPUT)
>>>      # prepare INPUT
>>>  def incrementToken(self):
>>>      # stuff into termAtt/offsetAtt/posIncrAtt
>>> 
>>> class Analyzer6(PythonAnalyzer):
>>>  def createComponents(self, fieldName):
>>>      return Analyzer.TokenStreamComponents(_Tokenizer())
>>> 
>>> The PositionIncrementTestCase is pretty similar but initialized with 
>>> static input. Would be a nice place for an example with dynamic input, I 
>>> think.
>>> 
>>> This was our 3.6 approach:
>>> class Analyzer3(PythonAnalyzer):
>>>  def tokenStream(self, fieldName, reader):
>>>     data = data_from_reader(reader)
>>>     class _tokenStream(PythonTokenStream):
>>>         def __init__(self):
>>>              super(_tokenStream, self).__init__()
>>>              # prepare termAtt/offsetAtt/posIncrAtt
>>>         def incrementToken(self):
>>>              # stuff from data into termAtt/offsetAtt/posIncrAtt
>>>    return _tokenStream()
>>> 
>>> Any hints how to get Analyzer6 working?
>> 
>> I've lost track of the countless API changes since 3.x.
>> 
>> The Lucene project does a good job at tracking them in the CHANGES.txt 
>> file, usually pointing at the issue that tracked it, often with examples 
>> about how to accomplish the same in the new way and the rationale behind 
>> the change.
>
> I guess we are here:
> https://issues.apache.org/jira/browse/LUCENE-5388
> https://svn.apache.org/viewvc?view=revision&revision=1556801
>
>> You can also look at the PyLucene tests I just ported to 6.x. For example, 
>> in test_Analyzers.py, you can see that Tokenizer no longer takes a reader 
>> but can be set one with setReader() after construction.
>
> Yes, I've done that pretty carefully. I think, this quote points in the right 
> direction: "The tokenStream method takes a String or Reader and will pass 
> this to Tokenizer#setReader()."
> from: 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3C021701d04f86$55331f10$ff995d30$@thetaphi.de%3E
>
> I've checked the lucene source and this happens automatically an cannot be 
> overwritten.
>
> So I've hacked something ugly together which seems to work.
>
> class _Tokenizer(PythonTokenizer):
>   def __init__(self, getReader):
>       super(_Tokenizer, self).__init__()
>       self.getReader = getReader
>       self.i = 0
>       self.data = []
>
>   def incrementToken(self):
>       if self.i == 0:
>           self.data = data_from_reader(self.getReader())
>       if self.i == len(self.data):
>           # we are reused - reset
>           self.i = 0
>           return False
>       # stuff from self.data into termAtt/offsetAtt/posIncrAtt
>       self.i += 1
>       return True
>
> class Analyzer6(PythonAnalyzer):
>   def createComponents(self, fieldName):
>        return Analyzer.TokenStreamComponents(_Tokenizer(lambda: 
> self._reader))
>   def initReader(self, fieldName, reader):
>       # capture reader
>       self._reader = reader
>       return reader
>
> I've made initReader() python-overridable (see patch). What do you think?

Not sure what to think. While your change looks fine, if Lucene decided to 
make this 'hard', it may be a sign that you're doing something wrong or 
going the wrong way about it.

I suggest you ask on the java-user@lucene.apache.org list as you're probably 
not the first one to transition from 3.x to something more recent.

Please let pylucene-dev@ know what you find out...

Andi..

>
> --dirk

Re: testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.
Am 08.09.2016, 11:10 Uhr, schrieb Andi Vajda <va...@apache.org>:

>
> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>
>> Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:

>> class _Tokenizer(PythonTokenizer):
>>   def __init__(self, INPUT):
>> 	super(_Tokenizer, self).__init__(INPUT)
>>       # prepare INPUT
>>   def incrementToken(self):
>>       # stuff into termAtt/offsetAtt/posIncrAtt
>>
>> class Analyzer6(PythonAnalyzer):
>>   def createComponents(self, fieldName):
>>       return Analyzer.TokenStreamComponents(_Tokenizer())
>>
>> The PositionIncrementTestCase is pretty similar but initialized with  
>> static input. Would be a nice place for an example with dynamic input,  
>> I think.
>>
>> This was our 3.6 approach:
>> class Analyzer3(PythonAnalyzer):
>>   def tokenStream(self, fieldName, reader):
>>      data = data_from_reader(reader)
>>      class _tokenStream(PythonTokenStream):
>>          def __init__(self):
>>               super(_tokenStream, self).__init__()
>>               # prepare termAtt/offsetAtt/posIncrAtt
>>          def incrementToken(self):
>>               # stuff from data into termAtt/offsetAtt/posIncrAtt
>>     return _tokenStream()
>>
>> Any hints how to get Analyzer6 working?
>
> I've lost track of the countless API changes since 3.x.
>
> The Lucene project does a good job at tracking them in the CHANGES.txt  
> file, usually pointing at the issue that tracked it, often with examples  
> about how to accomplish the same in the new way and the rationale behind  
> the change.

I guess we are here:
https://issues.apache.org/jira/browse/LUCENE-5388
https://svn.apache.org/viewvc?view=revision&revision=1556801

> You can also look at the PyLucene tests I just ported to 6.x. For  
> example, in test_Analyzers.py, you can see that Tokenizer no longer  
> takes a reader but can be set one with setReader() after construction.

Yes, I've done that pretty carefully. I think, this quote points in the  
right direction: "The tokenStream method takes a String or Reader and will  
pass this to Tokenizer#setReader()."
from:  
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3C021701d04f86$55331f10$ff995d30$@thetaphi.de%3E

I've checked the lucene source and this happens automatically an cannot be  
overwritten.

So I've hacked something ugly together which seems to work.

class _Tokenizer(PythonTokenizer):
     def __init__(self, getReader):
         super(_Tokenizer, self).__init__()
         self.getReader = getReader
         self.i = 0
         self.data = []

     def incrementToken(self):
         if self.i == 0:
             self.data = data_from_reader(self.getReader())
         if self.i == len(self.data):
             # we are reused - reset
             self.i = 0
             return False
         # stuff from self.data into termAtt/offsetAtt/posIncrAtt
         self.i += 1
         return True

class Analyzer6(PythonAnalyzer):
     def createComponents(self, fieldName):
          return Analyzer.TokenStreamComponents(_Tokenizer(lambda:  
self._reader))
     def initReader(self, fieldName, reader):
         # capture reader
         self._reader = reader
         return reader

I've made initReader() python-overridable (see patch). What do you think?

--dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.
On Thu, 8 Sep 2016, Dirk Rothe wrote:

> Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Mon, 5 Sep 2016, Dirk Rothe wrote:
>>>> A volunteer is requested to build and test PyLucene's trunk on Windows. 
>>>> If noone comes forward, I intend to try to release PyLucene 6.2 in a few 
>>>> weeks, still.
>>> 
>>> Nice Job!
>>> 
>>> I've successfully build PyLucene 6.2 on windows. Most tests pass:
>>> * skipped the three test_ICU* due to missing "import icu"
>> 
>> Yes, for this you need to install PyICU: https://github.com/ovalhub/pyicu
>
> I'm going to assume this would work for now.
>
>>> * fixed test_PyLucene.py by ignoring open file handles (os.error) in 
>>> shutil.rmtree() in Test_PyLuceneWithFSStore.tearDown()
>> 
>> Do you have a patch for me to apply ?
>
> Yes, attached.

Thanks, applied.

>>> * then stuff like these in test_PythonDirectory.py
> [..]
>> Can't make sense of this one, sorry.
>> 
>>> * and this one in test_PythonException.py
> [..]
>> This one could be because you may not have built JCC in shared mode ?
>> I vaguely remember there being a problem with proper cross-boundary 
>> exception propagation requiring JCC to be built in shared mode.
>
> jcc.SHARED reports True, so seems OK.
>
> I don't think these Windows glitches are really problematic, and our 
> production code runs only in linux environments anyway.
> And I'm more interested in whether porting around 3kloc lucene-interfaces 
> from v3.6 goes smoothly.
>
> I've hit the first problematic case with an custom 
> PythonAnalyzer/PythonTokenizer where I don't see how to pass the input to the 
> Tokenizer implementation.
> I thought maybe like this, but PythonTokenizer does not accept an INPUT 
> anymore (available in v4.10 and v3.6).
>
> class _Tokenizer(PythonTokenizer):
>   def __init__(self, INPUT):
> 	super(_Tokenizer, self).__init__(INPUT)
>       # prepare INPUT
>   def incrementToken(self):
>       # stuff into termAtt/offsetAtt/posIncrAtt
>
> class Analyzer6(PythonAnalyzer):
>   def createComponents(self, fieldName):
>       return Analyzer.TokenStreamComponents(_Tokenizer())
>
> The PositionIncrementTestCase is pretty similar but initialized with static 
> input. Would be a nice place for an example with dynamic input, I think.
>
> This was our 3.6 approach:
> class Analyzer3(PythonAnalyzer):
>   def tokenStream(self, fieldName, reader):
>      data = data_from_reader(reader)
>      class _tokenStream(PythonTokenStream):
>          def __init__(self):
>               super(_tokenStream, self).__init__()
>               # prepare termAtt/offsetAtt/posIncrAtt
>          def incrementToken(self):
>               # stuff from data into termAtt/offsetAtt/posIncrAtt
>     return _tokenStream()
>
> Any hints how to get Analyzer6 working?

I've lost track of the countless API changes since 3.x.

The Lucene project does a good job at tracking them in the CHANGES.txt file, 
usually pointing at the issue that tracked it, often with examples about how 
to accomplish the same in the new way and the rationale behind the change.

You can also look at the PyLucene tests I just ported to 6.x. For example, 
in test_Analyzers.py, you can see that Tokenizer no longer takes a reader 
but can be set one with setReader() after construction.

Andi..

>
> --dirk