You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Andi Vajda <va...@apache.org> on 2016/09/03 16:34:58 UTC

Re: [POLL] What should happen to PyLucene now?

On Mon, 22 Aug 2016, Andi Vajda wrote:

>
> On Sun, 10 Jul 2016, Andi Vajda wrote:
>
>> Thank you Jan for starting this thread !
>> 
>> Of the nine people that responded, three were interested in a new 6.x 
>> release, with two offering to help make a new release happen.
>> 
>> A couple of others showed interest in JCC only.
>> 
>> Here is what I can propose:
>>  1. I can make sure a PyLucene can be buildt from Lucene 6.x and runs.
>
> PyLucene can now be built from Lucene's branch 6.x, on Mac OS X.
> It builds, loads, can run a couple of simple tests like test_Binary.py and
> test_BinaryDocument.py.
>
> Here is how one can reproduce what I just did:
>  - cd ~/apache
>  - git clone --branch branch_6x https://github.com/apache/lucene-solr.git 
> lucene.6x
>  - cd <pylucene dir>
>  - svn update
>    make sure you have a modern setuptools (if you are on linux, the
>    setuptools patching done by JCC to be able to build a plain shared
>    library most likely needs to be refreshed or maybe even eliminated).
>  - _install/bin/pip uninstall setuptools
>  - _install/bin/pip install setuptools
>  - cd jcc
>  - ../_install/bin/python setup.py build install
>  - cd ..
>  - make sources (this copies the lucene tree from the github tree cloned)
>  - make compile install
>
> If all worked, you can then:
>  - _install/bin/python
>  >>> import lucene
>  >>> lucene.initVM()
>  - _install/bin/python test/test_Binary.py
>
> I have a Python virtual env installed in pylucene/_install, this helps with 
> keeping different versions of software separate.
>
>>  2. Volunteers should then help in porting old 4.x tests, if they still
>>     apply, and import new tests from the current Lucene suite as they see
>>     fit.
>
> All other tests need to be carefully ported to match all the numerous API 
> changes and disappeared classes. For similar reasons, the extensions jar does 
> not build and is not currently included in the build. Its source java classes 
> need to be refreshed as tests get refreshed to 6.x.

PyLucene now builds and passes all its tests on Mac OS X and Linux.
It is thus in a state where a release candidate could be built and submitted 
for review.

A volunteer is requested to build and test PyLucene's trunk on Windows. If 
noone comes forward, I intend to try to release PyLucene 6.2 in a few weeks, 
still.

Thanks !

Andi..

>
> Andi..
>
>>  3. Once everyone involved is happy with test coverage (which was never
>>     exhaustive and need not be), a new release can be rolled and the
>>     Lucene PMC put to contribution again for votes.
>> 
>> If any of these steps end up stalling, no new release happens and the 
>> PyLucene subproject gets shutdown, eventually.
>> 
>> As for JCC, regardless of what happens to PyLucene itself, I'd very much 
>> like to port it to Python 3. I've already done this once, the port is 
>> available in a branch [1]. It 'just' needs to be refreshed. I intend to 
>> eventually get to this, unless someone with a stronger itch beats me to it.
>> 
>> Andi..
>> 
>> [1] http://svn.apache.org/repos/asf/lucene/pylucene/branches/python_3/jcc/
>> 
>> 
>> On Sat, 2 Jul 2016, Aric Coady wrote:
>> 
>>> [X]  I?ll help make a new release happen, if I get some help!
>>> 
>>>> On Jul 1, 2016, at 9:35 AM, Alexander Yaworsky 
>>>> <al...@gmail.com> wrote:
>>>> 
>>>> Well, this bothered me (not a dev but fixed some of your bugs locally
>>>> long long ago, why didn't send patches is another story). Here's my
>>>> opinion, as a user. 1. Be in sync with lucene is a must. 2. Be in sync
>>>> with python is a must. Therefore,
>>> 
>>> And +1 on staying current with lucene and python.
>>> 
>>>>> Question: What should happen to PyLucene now?
>>>>> 
>>>>> [ ]  I?m happy with the last 4.x release, no need for new releases
>>>>> [ ]  Please, a new 6.x release (but I can?t contribute)
>>>>> [ ]  I?ll help make a new release happen, if I get some help!
>>>>> [X]  Only care about the JCC part
>>>>> [X]  Close down the sub project -- IF YOU ARE UNABLE TO MAINTAIN
>>>>> [ ]  Don?t care. I?m no longer a user
>>>>> [X]  Other: Move JCC to P3
>>>> 
>>>> Actually, the brilliant part of this project is JCC. In a company I
>>>> work for we still use it to utilize Java libraries from python. This
>>>> is the fastest solution and this sub-project must exist separately
>>>> imo. We do not use Lucene since 00's btw.
>>>> 
>>>> Thanks.
>>>> 
>>>> Alexander.
>>> 
>>> 
>> 
>

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Fri, 9 Sep 2016, Dirk Rothe wrote:

> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>> 
>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>> think?
>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>> or going the wrong way about it.
>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>> probably not the first one to transition from 3.x to something more 
>>>> recent.
>>>> Please let pylucene-dev@ know what you find out...
>>> 
>>> OK.
>>> 
>>> Making Analyzer.initReader() python-overridable is also important for 
>>> use-cases like this: http://stackoverflow.com/a/10290635
>>> So the patch should be fine independently of my usage/hack.
>> 
>> Actually, your patch is not good enough. You need to add an implementation 
>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>> (search for createComponents() implementations) otherwise, when 
>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>> good, as an aside, if I could make a better error out of that...).
>
> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some imports 
> there, use patch).

Ooh, I forgot about samples. I need to check them and produce a new release 
candidate. Thanks for pointing this out.

Andi..

>
> Shouldn't the need for implementation be optional? I don't understand.
>
> --dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Fri, 9 Sep 2016, Dirk Rothe wrote:

> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>> 
>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>> think?
>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>> or going the wrong way about it.
>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>> probably not the first one to transition from 3.x to something more 
>>>> recent.
>>>> Please let pylucene-dev@ know what you find out...
>>> 
>>> OK.
>>> 
>>> Making Analyzer.initReader() python-overridable is also important for 
>>> use-cases like this: http://stackoverflow.com/a/10290635
>>> So the patch should be fine independently of my usage/hack.
>> 
>> Actually, your patch is not good enough. You need to add an implementation 
>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>> (search for createComponents() implementations) otherwise, when 
>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>> good, as an aside, if I could make a better error out of that...).
>
> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some imports 
> there, use patch).

I see no patch attached probably because the apache mail server strips 
attachments sent to mailing lists. Either you include the patch inline or 
you send it to me as an attachment directly.

Thanks !

Andi..

>
> Shouldn't the need for implementation be optional? I don't understand.
>
> --dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.


On Fri, 9 Sep 2016, Andi Vajda wrote:

>
> On Fri, 9 Sep 2016, Dirk Rothe wrote:
>
>> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>> 
>>> 
>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>> 
>>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>>> 
>>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>>> think?
>>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>>> or going the wrong way about it.
>>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>>> probably not the first one to transition from 3.x to something more 
>>>>> recent.
>>>>> Please let pylucene-dev@ know what you find out...
>>>> 
>>>> OK.
>>>> 
>>>> Making Analyzer.initReader() python-overridable is also important for 
>>>> use-cases like this: http://stackoverflow.com/a/10290635
>>>> So the patch should be fine independently of my usage/hack.
>>> 
>>> Actually, your patch is not good enough. You need to add an implementation 
>>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>>> (search for createComponents() implementations) otherwise, when 
>>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>>> good, as an aside, if I could make a better error out of that...).
>> 
>> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some 
>> imports there, use patch).
>> 
>> Shouldn't the need for implementation be optional? I don't understand.
>
> Once you define a native method on a class, a native method implementation 
> must be provided. JCC does that but that native implementation just invokes 
> the python implementation on the python subclass instance. If that python 
> subclass has no method implementation, the inherited method is invoked again, 
> which in turn calls the native method again, and so on until the stack 
> overflows.
>
> This could maybe be improved at the JCC level but until it is, a Python
> implementation must be provided. The default initReader() method just returns 
> 'reader' and so should the python default implementation.

I just did this so that I could produce a new RC and restart the voting 
process. I added initReader() and all needed implementations in tests and 
samples.

Thanks !

Andi..

>
> Andi..
>
>> 
>> --dirk
>

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Fri, 9 Sep 2016, Dirk Rothe wrote:

> Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>> 
>>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>> I've made initReader() python-overridable (see patch). What do you 
>>>>> think?
>>>> Not sure what to think. While your change looks fine, if Lucene decided 
>>>> to make this 'hard', it may be a sign that you're doing something wrong 
>>>> or going the wrong way about it.
>>>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>>>> probably not the first one to transition from 3.x to something more 
>>>> recent.
>>>> Please let pylucene-dev@ know what you find out...
>>> 
>>> OK.
>>> 
>>> Making Analyzer.initReader() python-overridable is also important for 
>>> use-cases like this: http://stackoverflow.com/a/10290635
>>> So the patch should be fine independently of my usage/hack.
>> 
>> Actually, your patch is not good enough. You need to add an implementation 
>> for initReader() in all the tests that make a subclass of PythonAnalyzer 
>> (search for createComponents() implementations) otherwise, when 
>> initReader() gets called from Java, you'll get a stack overflow (it'd be 
>> good, as an aside, if I could make a better error out of that...).
>
> OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some imports 
> there, use patch).
>
> Shouldn't the need for implementation be optional? I don't understand.

Once you define a native method on a class, a native method implementation 
must be provided. JCC does that but that native implementation just invokes 
the python implementation on the python subclass instance. If that python 
subclass has no method implementation, the inherited method is invoked 
again, which in turn calls the native method again, and so on until the 
stack overflows.

This could maybe be improved at the JCC level but until it is, a Python
implementation must be provided. The default initReader() method just 
returns 'reader' and so should the python default implementation.

Andi..

>
> --dirk

Re: testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.

Am 09.09.2016, 00:29 Uhr, schrieb Andi Vajda <va...@apache.org>:

>
> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>
>> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>>
>>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>>>  I've made initReader() python-overridable (see patch). What do you  
>>>> think?
>>>  Not sure what to think. While your change looks fine, if Lucene  
>>> decided to make this 'hard', it may be a sign that you're doing  
>>> something wrong or going the wrong way about it.
>>>  I suggest you ask on the java-user@lucene.apache.org list as you're  
>>> probably not the first one to transition from 3.x to something more  
>>> recent.
>>>  Please let pylucene-dev@ know what you find out...
>>
>> OK.
>>
>> Making Analyzer.initReader() python-overridable is also important for  
>> use-cases like this: http://stackoverflow.com/a/10290635
>> So the patch should be fine independently of my usage/hack.
>
> Actually, your patch is not good enough. You need to add an  
> implementation for initReader() in all the tests that make a subclass of  
> PythonAnalyzer (search for createComponents() implementations)  
> otherwise, when initReader() gets called from Java, you'll get a stack  
> overflow (it'd be good, as an aside, if I could make a better error out  
> of that...).

OK, I see the effect in samples/PorterStemmerAnalyzer.py (fixed some  
imports there, use patch).

Shouldn't the need for implementation be optional? I don't understand.

--dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Thu, 8 Sep 2016, Dirk Rothe wrote:

> Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>>>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>> 
>>> I've made initReader() python-overridable (see patch). What do you think?
>> 
>> Not sure what to think. While your change looks fine, if Lucene decided to 
>> make this 'hard', it may be a sign that you're doing something wrong or 
>> going the wrong way about it.
>> 
>> I suggest you ask on the java-user@lucene.apache.org list as you're 
>> probably not the first one to transition from 3.x to something more recent.
>> 
>> Please let pylucene-dev@ know what you find out...
>
> OK.
>
> Making Analyzer.initReader() python-overridable is also important for 
> use-cases like this: http://stackoverflow.com/a/10290635
> So the patch should be fine independently of my usage/hack.

Actually, your patch is not good enough. You need to add an implementation 
for initReader() in all the tests that make a subclass of PythonAnalyzer 
(search for createComponents() implementations) otherwise, when initReader() 
gets called from Java, you'll get a stack overflow (it'd be good, as an 
aside, if I could make a better error out of that...).

Thanks !

Andi..

>
> --dirk
>

Re: testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.

Am 08.09.2016, 15:56 Uhr, schrieb Andi Vajda <va...@apache.org>:

>>>  On Thu, 8 Sep 2016, Dirk Rothe wrote:
>>
>> I've made initReader() python-overridable (see patch). What do you  
>> think?
>
> Not sure what to think. While your change looks fine, if Lucene decided  
> to make this 'hard', it may be a sign that you're doing something wrong  
> or going the wrong way about it.
>
> I suggest you ask on the java-user@lucene.apache.org list as you're  
> probably not the first one to transition from 3.x to something more  
> recent.
>
> Please let pylucene-dev@ know what you find out...

OK.

Making Analyzer.initReader() python-overridable is also important for  
use-cases like this: http://stackoverflow.com/a/10290635
So the patch should be fine independently of my usage/hack.

--dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Thu, 8 Sep 2016, Dirk Rothe wrote:

> Am 08.09.2016, 11:10 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>> 
>>> Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>>> class _Tokenizer(PythonTokenizer):
>>>  def __init__(self, INPUT):
>>> 	super(_Tokenizer, self).__init__(INPUT)
>>>      # prepare INPUT
>>>  def incrementToken(self):
>>>      # stuff into termAtt/offsetAtt/posIncrAtt
>>> 
>>> class Analyzer6(PythonAnalyzer):
>>>  def createComponents(self, fieldName):
>>>      return Analyzer.TokenStreamComponents(_Tokenizer())
>>> 
>>> The PositionIncrementTestCase is pretty similar but initialized with 
>>> static input. Would be a nice place for an example with dynamic input, I 
>>> think.
>>> 
>>> This was our 3.6 approach:
>>> class Analyzer3(PythonAnalyzer):
>>>  def tokenStream(self, fieldName, reader):
>>>     data = data_from_reader(reader)
>>>     class _tokenStream(PythonTokenStream):
>>>         def __init__(self):
>>>              super(_tokenStream, self).__init__()
>>>              # prepare termAtt/offsetAtt/posIncrAtt
>>>         def incrementToken(self):
>>>              # stuff from data into termAtt/offsetAtt/posIncrAtt
>>>    return _tokenStream()
>>> 
>>> Any hints how to get Analyzer6 working?
>> 
>> I've lost track of the countless API changes since 3.x.
>> 
>> The Lucene project does a good job at tracking them in the CHANGES.txt 
>> file, usually pointing at the issue that tracked it, often with examples 
>> about how to accomplish the same in the new way and the rationale behind 
>> the change.
>
> I guess we are here:
> https://issues.apache.org/jira/browse/LUCENE-5388
> https://svn.apache.org/viewvc?view=revision&revision=1556801
>
>> You can also look at the PyLucene tests I just ported to 6.x. For example, 
>> in test_Analyzers.py, you can see that Tokenizer no longer takes a reader 
>> but can be set one with setReader() after construction.
>
> Yes, I've done that pretty carefully. I think, this quote points in the right 
> direction: "The tokenStream method takes a String or Reader and will pass 
> this to Tokenizer#setReader()."
> from: 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3C021701d04f86$55331f10$ff995d30$@thetaphi.de%3E
>
> I've checked the lucene source and this happens automatically an cannot be 
> overwritten.
>
> So I've hacked something ugly together which seems to work.
>
> class _Tokenizer(PythonTokenizer):
>   def __init__(self, getReader):
>       super(_Tokenizer, self).__init__()
>       self.getReader = getReader
>       self.i = 0
>       self.data = []
>
>   def incrementToken(self):
>       if self.i == 0:
>           self.data = data_from_reader(self.getReader())
>       if self.i == len(self.data):
>           # we are reused - reset
>           self.i = 0
>           return False
>       # stuff from self.data into termAtt/offsetAtt/posIncrAtt
>       self.i += 1
>       return True
>
> class Analyzer6(PythonAnalyzer):
>   def createComponents(self, fieldName):
>        return Analyzer.TokenStreamComponents(_Tokenizer(lambda: 
> self._reader))
>   def initReader(self, fieldName, reader):
>       # capture reader
>       self._reader = reader
>       return reader
>
> I've made initReader() python-overridable (see patch). What do you think?

Not sure what to think. While your change looks fine, if Lucene decided to 
make this 'hard', it may be a sign that you're doing something wrong or 
going the wrong way about it.

I suggest you ask on the java-user@lucene.apache.org list as you're probably 
not the first one to transition from 3.x to something more recent.

Please let pylucene-dev@ know what you find out...

Andi..

>
> --dirk

Re: testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.

Am 08.09.2016, 11:10 Uhr, schrieb Andi Vajda <va...@apache.org>:

>
> On Thu, 8 Sep 2016, Dirk Rothe wrote:
>
>> Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:

>> class _Tokenizer(PythonTokenizer):
>>   def __init__(self, INPUT):
>> 	super(_Tokenizer, self).__init__(INPUT)
>>       # prepare INPUT
>>   def incrementToken(self):
>>       # stuff into termAtt/offsetAtt/posIncrAtt
>>
>> class Analyzer6(PythonAnalyzer):
>>   def createComponents(self, fieldName):
>>       return Analyzer.TokenStreamComponents(_Tokenizer())
>>
>> The PositionIncrementTestCase is pretty similar but initialized with  
>> static input. Would be a nice place for an example with dynamic input,  
>> I think.
>>
>> This was our 3.6 approach:
>> class Analyzer3(PythonAnalyzer):
>>   def tokenStream(self, fieldName, reader):
>>      data = data_from_reader(reader)
>>      class _tokenStream(PythonTokenStream):
>>          def __init__(self):
>>               super(_tokenStream, self).__init__()
>>               # prepare termAtt/offsetAtt/posIncrAtt
>>          def incrementToken(self):
>>               # stuff from data into termAtt/offsetAtt/posIncrAtt
>>     return _tokenStream()
>>
>> Any hints how to get Analyzer6 working?
>
> I've lost track of the countless API changes since 3.x.
>
> The Lucene project does a good job at tracking them in the CHANGES.txt  
> file, usually pointing at the issue that tracked it, often with examples  
> about how to accomplish the same in the new way and the rationale behind  
> the change.

I guess we are here:
https://issues.apache.org/jira/browse/LUCENE-5388
https://svn.apache.org/viewvc?view=revision&revision=1556801

> You can also look at the PyLucene tests I just ported to 6.x. For  
> example, in test_Analyzers.py, you can see that Tokenizer no longer  
> takes a reader but can be set one with setReader() after construction.

Yes, I've done that pretty carefully. I think, this quote points in the  
right direction: "The tokenStream method takes a String or Reader and will  
pass this to Tokenizer#setReader()."
from:  
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3C021701d04f86$55331f10$ff995d30$@thetaphi.de%3E

I've checked the lucene source and this happens automatically an cannot be  
overwritten.

So I've hacked something ugly together which seems to work.

class _Tokenizer(PythonTokenizer):
     def __init__(self, getReader):
         super(_Tokenizer, self).__init__()
         self.getReader = getReader
         self.i = 0
         self.data = []

     def incrementToken(self):
         if self.i == 0:
             self.data = data_from_reader(self.getReader())
         if self.i == len(self.data):
             # we are reused - reset
             self.i = 0
             return False
         # stuff from self.data into termAtt/offsetAtt/posIncrAtt
         self.i += 1
         return True

class Analyzer6(PythonAnalyzer):
     def createComponents(self, fieldName):
          return Analyzer.TokenStreamComponents(_Tokenizer(lambda:  
self._reader))
     def initReader(self, fieldName, reader):
         # capture reader
         self._reader = reader
         return reader

I've made initReader() python-overridable (see patch). What do you think?

--dirk

Re: testing PyLucene 6.2

Posted by Andi Vajda <va...@apache.org>.

On Thu, 8 Sep 2016, Dirk Rothe wrote:

> Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:
>
>> 
>> On Mon, 5 Sep 2016, Dirk Rothe wrote:
>>>> A volunteer is requested to build and test PyLucene's trunk on Windows. 
>>>> If noone comes forward, I intend to try to release PyLucene 6.2 in a few 
>>>> weeks, still.
>>> 
>>> Nice Job!
>>> 
>>> I've successfully build PyLucene 6.2 on windows. Most tests pass:
>>> * skipped the three test_ICU* due to missing "import icu"
>> 
>> Yes, for this you need to install PyICU: https://github.com/ovalhub/pyicu
>
> I'm going to assume this would work for now.
>
>>> * fixed test_PyLucene.py by ignoring open file handles (os.error) in 
>>> shutil.rmtree() in Test_PyLuceneWithFSStore.tearDown()
>> 
>> Do you have a patch for me to apply ?
>
> Yes, attached.

Thanks, applied.

>>> * then stuff like these in test_PythonDirectory.py
> [..]
>> Can't make sense of this one, sorry.
>> 
>>> * and this one in test_PythonException.py
> [..]
>> This one could be because you may not have built JCC in shared mode ?
>> I vaguely remember there being a problem with proper cross-boundary 
>> exception propagation requiring JCC to be built in shared mode.
>
> jcc.SHARED reports True, so seems OK.
>
> I don't think these Windows glitches are really problematic, and our 
> production code runs only in linux environments anyway.
> And I'm more interested in whether porting around 3kloc lucene-interfaces 
> from v3.6 goes smoothly.
>
> I've hit the first problematic case with an custom 
> PythonAnalyzer/PythonTokenizer where I don't see how to pass the input to the 
> Tokenizer implementation.
> I thought maybe like this, but PythonTokenizer does not accept an INPUT 
> anymore (available in v4.10 and v3.6).
>
> class _Tokenizer(PythonTokenizer):
>   def __init__(self, INPUT):
> 	super(_Tokenizer, self).__init__(INPUT)
>       # prepare INPUT
>   def incrementToken(self):
>       # stuff into termAtt/offsetAtt/posIncrAtt
>
> class Analyzer6(PythonAnalyzer):
>   def createComponents(self, fieldName):
>       return Analyzer.TokenStreamComponents(_Tokenizer())
>
> The PositionIncrementTestCase is pretty similar but initialized with static 
> input. Would be a nice place for an example with dynamic input, I think.
>
> This was our 3.6 approach:
> class Analyzer3(PythonAnalyzer):
>   def tokenStream(self, fieldName, reader):
>      data = data_from_reader(reader)
>      class _tokenStream(PythonTokenStream):
>          def __init__(self):
>               super(_tokenStream, self).__init__()
>               # prepare termAtt/offsetAtt/posIncrAtt
>          def incrementToken(self):
>               # stuff from data into termAtt/offsetAtt/posIncrAtt
>     return _tokenStream()
>
> Any hints how to get Analyzer6 working?

I've lost track of the countless API changes since 3.x.

The Lucene project does a good job at tracking them in the CHANGES.txt file, 
usually pointing at the issue that tracked it, often with examples about how 
to accomplish the same in the new way and the rationale behind the change.

You can also look at the PyLucene tests I just ported to 6.x. For example, 
in test_Analyzers.py, you can see that Tokenizer no longer takes a reader 
but can be set one with setReader() after construction.

Andi..

>
> --dirk

testing PyLucene 6.2

Posted by Dirk Rothe <d....@semantics.de>.

Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:

>
> On Mon, 5 Sep 2016, Dirk Rothe wrote:
>>>  A volunteer is requested to build and test PyLucene's trunk on  
>>> Windows. If noone comes forward, I intend to try to release PyLucene  
>>> 6.2 in a few weeks, still.
>>
>> Nice Job!
>>
>> I've successfully build PyLucene 6.2 on windows. Most tests pass:
>> * skipped the three test_ICU* due to missing "import icu"
>
> Yes, for this you need to install PyICU: https://github.com/ovalhub/pyicu

I'm going to assume this would work for now.

>> * fixed test_PyLucene.py by ignoring open file handles (os.error) in  
>> shutil.rmtree() in Test_PyLuceneWithFSStore.tearDown()
>
> Do you have a patch for me to apply ?

Yes, attached.

>> * then stuff like these in test_PythonDirectory.py
[..]
> Can't make sense of this one, sorry.
>
>> * and this one in test_PythonException.py
[..]
> This one could be because you may not have built JCC in shared mode ?
> I vaguely remember there being a problem with proper cross-boundary  
> exception propagation requiring JCC to be built in shared mode.

jcc.SHARED reports True, so seems OK.

I don't think these Windows glitches are really problematic, and our  
production code runs only in linux environments anyway.
And I'm more interested in whether porting around 3kloc lucene-interfaces  
 from v3.6 goes smoothly.

I've hit the first problematic case with an custom  
PythonAnalyzer/PythonTokenizer where I don't see how to pass the input to  
the Tokenizer implementation.
I thought maybe like this, but PythonTokenizer does not accept an INPUT  
anymore (available in v4.10 and v3.6).

class _Tokenizer(PythonTokenizer):
     def __init__(self, INPUT):
	super(_Tokenizer, self).__init__(INPUT)
         # prepare INPUT
     def incrementToken(self):
         # stuff into termAtt/offsetAtt/posIncrAtt

class Analyzer6(PythonAnalyzer):
     def createComponents(self, fieldName):
         return Analyzer.TokenStreamComponents(_Tokenizer())

The PositionIncrementTestCase is pretty similar but initialized with  
static input. Would be a nice place for an example with dynamic input, I  
think.

This was our 3.6 approach:
class Analyzer3(PythonAnalyzer):
     def tokenStream(self, fieldName, reader):
        data = data_from_reader(reader)
        class _tokenStream(PythonTokenStream):
            def __init__(self):
                 super(_tokenStream, self).__init__()
                 # prepare termAtt/offsetAtt/posIncrAtt
            def incrementToken(self):
                 # stuff from data into termAtt/offsetAtt/posIncrAtt
       return _tokenStream()

Any hints how to get Analyzer6 working?

--dirk