You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Thomas Koch <ko...@orbiteam.de> on 2012/02/01 10:51:23 UTC

Setting Stopword Set in PyLucene (or using Set in general)

Hi,

is there any way to use the Java Set class in PyLucene? e.g. the
StopAnalyzer has a constructor with a Set for stopwords:
      StopAnalyzer(Version matchVersion, Set<?> stopWords)
see
http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/analysis/StopA
nalyzer.html 

This used to be a list earlier (and worked with Python list of string as
argument), but I fail to pass anything to this constructor (see code below).
I tried to use Python set and lucene.Set as well (cannot be instantiated
though). Did anyone manage to do this yet? Couldn't find examples in the
samples folder either.

(background: I'm currently trying to port some code from PyLucene2.9 to
PyLucene 3.x ...)

Example:

import lucene	
lucene.initVM()
v=lucene.Version.LUCENE_CURRENT	
a= lucene.StopAnalyzer(v)
a.getStopwordSet()
<Set: [but, be, with, such, then, for, no...]> 
s = set(['der','die','das'])# python set
b= lucene.StopAnalyzer(v,s)
   Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   lucene.InvalidArgsError: (<type 'StopAnalyzer'>, '__init__', (<Version:
LUCENE_CURRENT>, set(['die', 'der', 'das'])))

s = lucene.Set(['der','die','das'])
  Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   NotImplementedError: ('instantiating java class', <type 'Set'>)

regards
Thomas
--
Thomas Koch
OrbiTeam Software GmbH & Co. KG
Bonn, Germany
http://www.orbiteam.de







AW: AW: AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.
Hi,
I have to add a comment to my previous mail:

> I'd preferred using this option (#2) in toArray (for both JavaList and
> JavaSet) as it does not require the wrapping into  Java Integer (etc.)
objects.
> However this method does not work with lucene.ArrayList:
> 
>  >> x=lucene. JArray ('int')([1,2])
>  JArray<int>[1, 2]
>  >>> y=lucene. ArrayList (x)
>  Traceback: lucene.InvalidArgsError:
>   (<type 'ArrayList'>, '__init__', (JArray<int>[1, 2],))
> 
Sorry - that's rubbish of course: ArrayList requires a collection in its
constructor and JArray isn't a collection. So this can't work! The
'challenge' was to be able to use JavaSet and/or JavaList (both are
collections) as an argument for ArrayList. (During init of ArrayList the
toArray() method is called however.)

So I gave it a quick try again, and tried the 2nd alternative:

> 1) return JArray(object)([<lucene.Integer()-object>*])
> or
> 2) return JArray(int)([<Python-int-literal>*])

but that option then fails (in the demo code) when using bool (or float)
types. Attached is a revised version of collections.py with the alternative
code (disabled) - if anyone's interested...

The mentioned issue with the created JArray containing the same objects
still remains. I'll have to look deeper into that, but as said I'm out of
office next week ...

BTW, sorry if this is out of scope of the PyLucene mailing list (it's more a
JCC related discussion) - we can continue with 'private' mail if that's
preferred. 

Regards,
Thomas

AW: AW: AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.
Hi Andi,
thanks for your feedback and for the code cleanup.

Regarding the 'toArray'-issue I tried different versions of JArray
'typed-constructor' and it turned out that these two alternatives basically
work:
(example for int types)

1) return JArray(object)([<lucene.Integer()-object>*])
or
2) return JArray(int)([<Python-int-literal>*])

Even surprising (for me): there are different ways to construct those
template-types using string or type:

 >> lucene.JArray(int)([1,2])
 and
 >> lucene.JArray('int')([1,2]) 
 both create the same type  <type 'JArray_int'>

I'd preferred using this option (#2) in toArray (for both JavaList and
JavaSet) as it does not require the wrapping into  Java Integer (etc.)
objects. However this method does not work with lucene.ArrayList:

 >> x=lucene.JArray('int')([1,2])
 JArray<int>[1, 2]
 >>> y=lucene.ArrayList(x)
 Traceback: lucene.InvalidArgsError:
  (<type 'ArrayList'>, '__init__', (JArray<int>[1, 2],))

So I decided to choose the Java-object wrapper option (#1) and implemented
toArray for primitive types (int,float,long,bool). It turned out that
wrapping strings is not needed.  That way the collections-demo runs fine and
I can init a lucene.ArrayList with the JavaSet or JavaList for the mentioned
types.

Attached is a revised version of collections.py and collections-demo.py
(which should run without error now).

However there's still one question/issue as you can see from the output of
collections-demo.py (and some commented 'test code' in collections-demo.py):

created JArray: JArray<object>[<Object: 0>, <Object: 1>, <Object: 2>,
<Object: 3>, ...] <type 'JArray_object'>
created ArrayList: [java.lang.Object@785d65, java.lang.Object@785d65,
java.lang.Object@785d65, java.lang.Object@785d65....,] <type 'ArrayList'>

It looks as if the objects passed in from JavaSet to lucene.ArrayList end up
in the same object (that's also why indexOf behaves somewhat strange). Could
be a bug in my test code, but this is no problem for lucene.HashSet(JavaSet)
for example so I'm really curious what's going on here...

If you have any ideas, pls let me know. Will also look into it again if I
got some time but shall be busy for most of the week and out of office next
week.

regards,
Thomas

-----Ursprüngliche Nachricht-----
Von: Andi Vajda [mailto:vajda@apache.org] 
Gesendet: Montag, 12. März 2012 03:34
An: pylucene-dev@lucene.apache.org
Cc: Thomas Koch
Betreff: Re: AW: AW: AW: Setting Stopword Set in PyLucene (or using Set in
general)


  Hi Thomas,

On Fri, 2 Mar 2012, Thomas Koch wrote:

> thanks for the feedback! I revised the code and send you attached a 
> new patch.

Sorry for the delay in getting back to you.

I integrated your patch and fixed a bunch of formatting and bugs in it.
The collections-demo.py is not fully functional yet so I attach it here too,
somewhat fixed up as well.

There is a bug somewhere with constructing an ArrayList from a python
collection like JavaSet or JavaList. At some point, toArray() gets called,
the right aray is returned (almost, see below) but the ArrayList looks like
built from an array of empty objects.

> I also attach a short demo script that shows the problems I mentioned 
> earlier when trying to initialize an ArrayList with a JavaSet (or 
> JavaList) containing integers.

For that the toArray() methods in collections.py must create use the correct
array type using int, float, etc... instead of object based on what's in the
python object.
Alternatively, they need these methods need to box the int values by
wrapping them into a Java Integer object (for example, lucene.Integer(5)).
I leave that to you to continue with, I'm out of time for right now :-)

> Finally I'd suggest to rename collections.py because there's one 
> defined on Python lib already:
> http://docs.python.org/library/collections.html

Until this happens, you can use:
  from lucene import collections
as the collections.py file gets installed in the lucene package.

Throwing Java exceptions from Python is done by raising JavaError with the
desired Java exception object (I added a few to the jcc call in PyLucene's
Makefile), for example:
   raise JavaError, NoSuchElementException(str(index))

It's been like that for a very long time, I just forgot.
This is implemented by throwPythonError() in jcc's functions.cpp: if the
error is JavaError, then the Java exception instance used as argument to it
is raised to the JVM.

I attached the not-checked-in diffs as patches. The new Makefile is checked
into the pylucene-3.x branch.

> Below are some comments to your comments...

More responses inline below.

> Ok, I was unsure on how to properly throw a Java Exception in Python 
> code - and couldn't find an example.
> Also I thought a Java Exception type should be exported in lucene - 
> this is not the case however:
>>>> lucene.NoSuchElementException
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> AttributeError: 'module' object has no attribute 'NoSuchElementException'
>
> I imagine I could
> - add the java.util.NoSuchElementException to the Makefile to get it 
> generated by JCC and throw it via raise?
> - use lucene.JavaError and pass  'java.util.NoSuchElementException' 
> name in the constructor?

Yes, you guessed it right, this is how it works as outlined above.

You had various bugs in next()/nextIndex(), previous()/previousIndex() that
I hopefully fixed. Also, listIterator() can't be overridden in Python, I
fixed it in PythonList and in collections.py.

Andi..

Re: AW: AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Andi Vajda <va...@apache.org>.
  Hi Thomas,

On Fri, 2 Mar 2012, Thomas Koch wrote:

> thanks for the feedback! I revised the code and send you attached a new
> patch.

Sorry for the delay in getting back to you.

I integrated your patch and fixed a bunch of formatting and bugs in it.
The collections-demo.py is not fully functional yet so I attach it here too, 
somewhat fixed up as well.

There is a bug somewhere with constructing an ArrayList from a python 
collection like JavaSet or JavaList. At some point, toArray() gets called, 
the right aray is returned (almost, see below) but the ArrayList looks like 
built from an array of empty objects.

> I also attach a short demo script that shows the problems I mentioned
> earlier when trying to initialize an ArrayList with a JavaSet (or JavaList)
> containing integers.

For that the toArray() methods in collections.py must create use the correct
array type using int, float, etc... instead of object based on what's in the 
python object.
Alternatively, they need these methods need to box the int values by 
wrapping them into a Java Integer object (for example, lucene.Integer(5)).
I leave that to you to continue with, I'm out of time for right now :-)

> Finally I'd suggest to rename collections.py because there's one defined on
> Python lib already:
> http://docs.python.org/library/collections.html

Until this happens, you can use:
  from lucene import collections
as the collections.py file gets installed in the lucene package.

Throwing Java exceptions from Python is done by raising JavaError with the 
desired Java exception object (I added a few to the jcc call in PyLucene's 
Makefile), for example:
   raise JavaError, NoSuchElementException(str(index))

It's been like that for a very long time, I just forgot.
This is implemented by throwPythonError() in jcc's functions.cpp: if the 
error is JavaError, then the Java exception instance used as argument to it 
is raised to the JVM.

I attached the not-checked-in diffs as patches. The new Makefile is checked 
into the pylucene-3.x branch.

> Below are some comments to your comments...

More responses inline below.

> Ok, I was unsure on how to properly throw a Java Exception in Python code -
> and couldn't find an example.
> Also I thought a Java Exception type should be exported in lucene - this is
> not the case however:
>>>> lucene.NoSuchElementException
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> AttributeError: 'module' object has no attribute 'NoSuchElementException'
>
> I imagine I could
> - add the java.util.NoSuchElementException to the Makefile to get it
> generated by JCC and throw it via raise?
> - use lucene.JavaError and pass  'java.util.NoSuchElementException' name in
> the constructor?

Yes, you guessed it right, this is how it works as outlined above.

You had various bugs in next()/nextIndex(), previous()/previousIndex() that 
I hopefully fixed. Also, listIterator() can't be overridden in Python, I 
fixed it in PythonList and in collections.py.

Andi..

AW: AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.
Hi Andi,
thanks for the feedback! I revised the code and send you attached a new
patch.

I also attach a short demo script that shows the problems I mentioned
earlier when trying to initialize an ArrayList with a JavaSet (or JavaList)
containing integers.

Finally I'd suggest to rename collections.py because there's one defined on
Python lib already:
http://docs.python.org/library/collections.html

Below are some comments to your comments...

Regards,
Thomas

> -----Ursprüngliche Nachricht-----
> Von: Andi Vajda [mailto:vajda@apache.org]
> Gesendet: Sonntag, 26. Februar 2012 23:29
> An: pylucene-dev@lucene.apache.org
> Betreff: Re: AW: AW: Setting Stopword Set in PyLucene (or using Set in
general)
> 
> ...
> According to the javadocs, this method is supposed to throw
NoSuchElementException. Raising StopIteration is not going to do the trick.
> Same comment on the previous method too.

Ok, I was unsure on how to properly throw a Java Exception in Python code -
and couldn't find an example. 
Also I thought a Java Exception type should be exported in lucene - this is
not the case however:
>>> lucene.NoSuchElementException
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'NoSuchElementException'

I imagine I could
- add the java.util.NoSuchElementException to the Makefile to get it
generated by JCC and throw it via raise?
- use lucene.JavaError and pass  'java.util.NoSuchElementException' name in
the constructor?
- extend / use PythonException?
- defined Python Exception class
NoSuchElementException(exceptions.Exception) and raise that one?
- raise RuntimeError, 'NoSuchElementException' and raise that one?
- define some helper methods for 'native' Java Exceptions in PythonList.java
and call 'em

Which one does the trick? Unless I know better I go with the last one...

(I understand PythonException is used by JCC to wrap errors that escape from
Python to Java and JavaError is used by JCC for Java Exceptions that escape
from Java to Python - but how do you 'fake' a Java Exception within Python?)

Same problem for IndexOutOfBoundsException in get()

> Why not also implement remove() and set() ?

Because they are optional ... I've implemented them now.

+    def lastIndexOf(obj):
> Wouldn't it be more efficient to iterate backwards until the element is
found instead of copying the list (self._lst[::-1]) and iterate forwards ?
Done.

+    def remove(self, obj_or_index):
+        if type(obj_or_index) is type(1):
+            return removeAt(int(obj_or_index))
+        return removeElement(obj_or_index)

> It's better to do this at the Java level. 
> Declare differently named native methods for each overload of remove() and
implement remove(int) in Java to call removeInt(int) and remove(Object) to
call removeObject(Object

Done. The different methods are declared private now.

+
+    def subList(fromIndex, toIndex):
+        sublst = self._lst[fromIndex:toIndex]
+        return JavaList(sublst)

> The javadoc expects this method to throws IndexOutOfBoundsException
instead of behaving nice like a Python slice.

This check (and Exception handling) is done on java-level now.


+public class PythonListIterator extends PythonIterator implements
ListIterator {
+
+    // private long pythonObject;
+
+    public PythonListIterator()
+    {
+    }
+ 
+    /* defined in super class PythonIterator:
...
> If this work, you don't need pythonObject to be protected anymore in the
superclass then ?

True - just reverted the changes in PythonIterator. 



Re: AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Andi Vajda <va...@apache.org>.
   Hi Thomas,

Here are comments inline on your patch, quoting just the relevant fragments.
Thank you !

Andi..

Index: python/collections.py
===================================================================
--- python/collections.py	(revision 1292224)
+++ python/collections.py	(working copy)

      def remove(self, obj):
          try:
              self._set.remove(obj)
@@ -104,7 +104,7 @@
      def retainAll(self, collection):
          result = False
          for obj in list(self._set):
-            if obj not in c:
+            if obj not in collection:
                  self._set.remove(obj)
                  result = True
          return result

Whoops. Thank you. Integrated.

+class JavaListIterator(PythonListIterator):
+    """
+    This class implements java.util.ListIterator for a Python list instance it wraps.
+    (simple  bidirectional iterator)
+    """
+    def __init__(self, _lst, index=0):
+        super(JavaListIterator, self).__init__()
+        self._lst = _lst
+        # TODO: raise JavaError for IndexOutOfBoundsException!?
+        assert (index>=0 and index<len(_lst)), "index is not out of range"
+        self.index = index

I'd not check index here and raise StopIteration later as needed.
That way, if _lst changes and index "becomes" valid, things still work.
Or conversely, if _lst changes and index becomes invalid, you're not depending
on this check.

+    def next(self):
+        try:
+            result = self._lst[self.index]
+            self.index += 1
+        except IndexError:
+            # TODO: raise JavaError for NoSuchElementException!?
+            raise StopIteration
+        return result

Why not just check for self.index to be in range here and raise StopIteration
instead of relying in the exception ? It's cheaper and that's how you do it
below, it's good to be consistent.

+    def previous(self):
+        self.index -= 1
+        if self.index < 0:
+            # TODO: raise JavaError for NoSuchElementException!?
+            raise StopIteration
+        return self.collection[self.index]

According to the javadocs, this method is supposed to throw
NoSuchElementException. Raising StopIteration is not going to do the trick.
Same comment on the previous method too.

+    def hasPrevious(self):
+        return self.index>0
+
+    def hasNext(self):
+        return self.index<len(self._lst)
+
+    def nextIndex(self):
+        return min(self.index,len(self._lst))
+
+    def previousIndex(self):
+        return max(self.index,-1)

Please, be consistent and use spaces between operators and after commas.

+    def __iter__(self):
+        return self

Why not also implement remove() and set() ?


+
+class JavaList(PythonList):
+    """
+    This class implements java.util.List around a Python list instance it wraps.
+    """
+
+    def __init__(self, _lst):
+        super(JavaList, self).__init__()
+        self._lst = _lst
+
+    def __contains__(self, obj):
+        return obj in self._lst
+
+    def __len__(self):
+        return len(self._lst)
+
+    def __iter__(self):
+        return iter(self._lst)
+
+    def add(self, index, obj):
+        self._lst.insert(index, obj)
+
+    def addAll(self, collection):
+        size = len(self._lst)
+        self._lst.extend(collection)
+        return len(self._lst) > size
+
+    def addAll(self, index, collection):
+        size = len(self._lst)
+        self._lst[index:index]=collection
+        return len(self._lst) > size
+
+    def clear(self):
+        del self._lst
+        self._lst = []

Why not clear the list in place with del self._lst[:] ?
Changing the _lst reference is going to trip over users who assume that _lst
is always what they put in to begin with.

+    def contains(self, obj):
+        return obj in self._lst
+
+    def containsAll(self, collection):
+        for obj in collection:
+            if obj not in self._lst:
+                return False
+        return True
+
+    def equals(self, collection):
+        if type(self) is type(collection):
+            return self._lst == collection._lst
+        return False
+
+    def get(index):
+        return self._lst[index]

What if index is out of range ?

+
+    def isEmpty(self):
+        return len(self._lst) == 0
+
+    def iterator(self):
+        class _iterator(PythonIterator):
+            def __init__(_self):
+                super(_iterator, _self).__init__()
+                _self._iterator = iter(self._lst)
+            def hasNext(_self):
+                if hasattr(_self, '_next'):
+                    return True
+                try:
+                    _self._next = _self._iterator.next()
+                    return True
+                except StopIteration:
+                    return False
+            def next(_self):
+                if hasattr(_self, '_next'):
+                    next = _self._next
+                    del _self._next
+                else:
+                    next = _self._iterator.next()
+                return next
+        return _iterator()
+
+    def lastIndexOf(obj):
+        try:
+            return len(self._lst)-1-self._lst[::-1].index(obj)
+        except ValueError:
+            return -1

Please use spaces between operators.
Wouldn't it be more efficient to iterate backwards until the element is found
instead of copying the list (self._lst[::-1]) and iterate forwards ?

+
+    def listIterator(self):
+        return JavaListIterator(self._lst)
+
+    def listIterator(self, index):
+        return JavaListIterator(self._lst, index)
+
+    def remove(self, obj_or_index):
+        if type(obj_or_index) is type(1):
+            return removeAt(int(obj_or_index))
+        return removeElement(obj_or_index)

It's better to do this at the Java level. 
Declare differently named native methods for each overload of remove() and
implement remove(int) in Java to call removeInt(int)
and remove(Object) to call removeObject(Object)  (or whatever you name them)

+    def removeAt(self, pos):
+        """Removes the element at the specified position in this list
+        """
+        try:
+            el = self._lst[pos]
+            del self._lst[pos]
+            return el
+        except IndexError:
+            # TODO: raise JavaError for IndexOutOfBoundsException!?
+            return None

Why not check the index to be in range instead of try/except ?

+    def removeElement(self, obj):
+        """Removes the first occurrence of the specified element from this list, if it is present
+        """
+        try:
+            self._lst.remove(obj)
+            return True
+        except ValueError:
+            return False
+
+    def removeAll(self, collection):
+        result = False
+        for obj in collection:
+            if self.removeElement(obj):
+                result = True
+        return result
+
+    def retainAll(self, collection):
+        result = False
+        for obj in self._lst:
+            if (obj not in collection
+                and self.removeElement(obj)):
+                result = True
+        return result
+
+    def size(self):
+        return len(self._lst)
+
+    def toArray(self):
+        return self._lst
+
+    def subList(fromIndex, toIndex):
+        sublst = self._lst[fromIndex:toIndex]
+        return JavaList(sublst)

The javadoc expects this method to throws IndexOutOfBoundsException instead of
behaving nice like a Python slice.

+    def set(index, obj):
+        try:
+            self._lst[index]=obj
+        except IndexError:
+            raise
+        #TODO raise JavaError for IndexOutOfBoundsException instead?!

Please use spaces between operators.


Index: java/org/apache/pylucene/util/PythonListIterator.java
===================================================================

+public class PythonListIterator extends PythonIterator implements ListIterator {
+
+    // private long pythonObject;
+
+    public PythonListIterator()
+    {
+    }
+ 
+    /* defined in super class PythonIterator:
+    public void pythonExtension(long pythonObject)
+    {
+        this.pythonObject = pythonObject;
+    }
+    public long pythonExtension()
+    {
+        return this.pythonObject;
+    }
+    */

If this work, you don't need pythonObject to be protected anymore in the
superclass then ?

+    public native boolean hasPrevious();
+    public native Object previous();
+ 
+    public native int nextIndex();
+    public native int previousIndex();
+ 
+    public void	set(Object obj) {
+        throw new UnsupportedOperationException();
+    }

What about remove () ?

+    public void add(Object obj) {
+        throw new UnsupportedOperationException();
+    }

Why not support them ?

+ 
+}
Index: java/org/apache/pylucene/util/PythonIterator.java
===================================================================
--- java/org/apache/pylucene/util/PythonIterator.java	(revision 1292224)
+++ java/org/apache/pylucene/util/PythonIterator.java	(working copy)
@@ -20,7 +20,7 @@

  public class PythonIterator implements Iterator {

-    private long pythonObject;
+    protected long pythonObject;

This shouldn't be needed anymore according to the code commented out above.

Index: java/org/apache/pylucene/util/PythonList.java
===================================================================

+    public native boolean add(Object obj);
+    public native void add(int index, Object obj);
+    public native boolean addAll(Collection c);
+    public native boolean addAll(int index, Collection c);
+    public native void clear();
+    public native boolean contains(Object obj);
+    public native boolean containsAll(Collection c);
+    public native boolean equals(Object obj);
+    public native Object get(int index);
+    // public native int hashCode();
+    public native int indexOf(Object obj);
+    public native boolean isEmpty();
+    public native Iterator iterator();
+    public native int lastIndexOf(Object obj);
+    public native ListIterator listIterator();
+    public native ListIterator listIterator(int index);

+    public native Object remove(int index);
+    public native boolean remove(Object obj);

Here you should declare different names for the remove() overloads and have
the remove() overloads invoke each native method accordingly.
For example:
     public native removeAt(int index);
     public Object remove(int index)
     {
         return removeAt(index);
     }

     public native boolean removeObject(Object obj);
     public boolean remove(Object obj)
     {
         return removeObject(obj);
     }


AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.

> Attached is a patch against
> http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_5/
> (revision 1292224)
> 
Should be attached NOW ,-)

regards
Thomas


Re: AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Andi Vajda <va...@apache.org>.
 Hi Thomas,

Message received. I should get to it in more detail in the next few days.

Andi..

On Feb 22, 2012, at 4:55, "Thomas Koch" <ko...@orbiteam.de> wrote:

> Hi Andi,
> 
>> ...
>> And the very same could be done for java.util.ArrayList. It should be easy
>> enough by following the JavaSet/PythonSet example.
>> 
>> If you send in a patch that implements this, I'd be glad to integrate it !
> 
> I've now implemented the PythonList as suggested - sorry for late reply, was
> busy with other things and actually didn't need it myself, but think it's
> useful anyway for JCC/PyLucene. 
> 
> I needed to add a PythonListIterator "wrapper" as well and needed to change
> the pythonObject reference in PythonIterator from private to
>    protected long pythonObject;
> 
> Hope that doesn't break anything... BTW, are there any tests for the
> Java/PythonXXX classes yet? The build did run fine and I was able to
> instantiate both JavaSet and JavaList in a Python shell (see example below).
> There is one problem/BUG left though: when creating a JavaList instance from
> a python list of ints a TypeError occurs. It does work for a list of str
> (see below).
> 
> The toArray() method (implemented as in the JavaSet class) seems to be the
> cause of the problem:
> 
>>>> l=range(3)
> [0, 1, 2]
>>>> pl= collections.JavaList(l)
> <JavaList: org.apache.pylucene.util.PythonList@12d96f2>
>>>> pl.toArray()
> [0, 1, 2]
>>>> jl = lucene.ArrayList(pl)
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> lucene.JavaError: org.apache.jcc.PythonException: ('while calling',
> 'toArray', [0, 1, 2])
> TypeError: ('while calling', 'toArray', [0, 1, 2])  Java stacktrace:
> org.apache.jcc.PythonException: ('while calling', 'toArray', [0, 1, 2])
> TypeError: ('while calling', 'toArray', [0, 1, 2])
>        at org.apache.pylucene.util.PythonList.toArray(Native Method)
>        at java.util.ArrayList.<init>(ArrayList.java:131)
> 
> I guess the ints need to be casted to Objects somehow. Interestingly this is
> done in the wrapped Java Classes like HashSet already:
> 
>>>> s = set(l)
>>>> ps = collections.JavaSet(s)
> <JavaSet: org.apache.pylucene.util.PythonSet@af993e>
>>>> ps.toArray()
> [0, 1, 2]
>>>> js = lucene.HashSet(ps)
> <HashSet: [0, 1, 2]>
>>>> js.toArray()
> JArray<object>[<Object: 0>, <Object: 1>, <Object: 2>]
> 
> I'm not sure how to fix this and would welcome suggestions.
> Is there some helper method for type-safe 'Python2Java casting' that should
> be used?
> 
> Some further (minor) remarks: I was wondering about "compatibility" with
> Java interfaces, i.e.
> - do we need to implement this method?
>    public native int hashCode();
> (currently not implemented by PythonList and PythonSet)
> 
> - do we need to replace Python Exception with their Java pendant?
>  e.g. IndexError -> IndexOutOfBoundsException
>  I've added some comments in the code where this could (should?) be done:
>  e.g.
>    # TODO: raise JavaError for IndexOutOfBoundsException!?
>    # TODO: raise JavaError for NoSuchElementException - if the
> iteration has no next element!?
>    
> - how to handle/implement methods with same signature (i.e. number of args)
> in Python?
> e.g.
>  public native Object remove(int index);
>  public native boolean remove(Object obj);
> 
> In this particular case I've implemented one Python method and did a type
> check, not sure if that's optimal or needed at all (does JCC handle this
> already?).
> 
> TODO: 
> - resolve TypeError for JavaList when created from python list of ints
> - remove comments (or change Exceptions)
> - write a test case (note: there was a bug/typo in JavaSet.retainAll() -
> fixed)
> - merge with trunk
> 
> Attached is a patch against
> http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_5/
> (revision 1292224)
> 
> I wouldn't recommend integrating the code until the mentioned bug is
> resolved.
> Of course I'm willing to finalize this properly but need some help at this
> point.
> 
> kind regards
> 
> Thomas 
> --
> OrbiTeam Software GmbH & Co. KG
> 53121 Bonn - Germany
> http://www.orbiteam.de
> --
> P.S.  And here is an example for those of you who ask "what are they talking
> about?"  
> 
>>>> import lucene
>>>> lucene.initVM()
> <jcc.JCCEnv object at 0x01390AF0>
>>>> l = ['a','b','c']
>>>> import collections
>>>> pl = collections.JavaList(l)
>>>> pl.size()
> 3
>>>> jl = lucene.ArrayList(pl)
>>>> jl
> <ArrayList: [a, b, c]>
>>>> 
> 
> now we have created an instance of a java.util.ArrayList with a python
> "native" list (l) wrapped by the collections.JavaList as constructor
> argument 
> 
>>>> s = set(l)
>>>> ps = collections.JavaSet(s)
>>>> ps
> <JavaSet: org.apache.pylucene.util.PythonSet@160a26f>
>>>> ps.size()
> 3
>>>> js = lucene.HashSet(ps)
>>>> js
> <HashSet: [b, c, a]>
> 
> 
> now we have created an instance of a java.utilHashSet with a python "native"
> set (s) wrapped by the collections.JavaSet as constructor argument 
> 
> (BTW, I found it difficult to understand why one class implemented in Java
> is called PythonSet whereas the one implemented in Python - and wraps the
> Java pendant - is called JavaSet, but that's just a comment and depends on
> the point of view)
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Andi Vajda [mailto:vajda@apache.org]
>> Gesendet: Mittwoch, 1. Februar 2012 19:08
>> An: pylucene-dev@lucene.apache.org
>> Betreff: Re: AW: Setting Stopword Set in PyLucene (or using Set in
> general)
>> 
>> 
>>  Hi Thomas,
>> 
>> On Wed, 1 Feb 2012, Thomas Koch wrote:
>> 
>>> OK, I found a solution (obviously not the best one...): lucene.Set is
>>> representing a java.util *interface* Set<E> which of course cannot be
>>> instantiated. HashSet is an implementing class, and can be
>>> instantiated. You can add elements via the add() method to the set then.
>> Example:
>>> 
>>> def get_lucene_set(python_list):
>>>   """convert python list into lucene.Set (Java.util.set interface)
>>>         using the HashSet class (java.util) wrapped in lucene.HashSet
>>>   """
>>>   hs = lucene.HashSet()
>>>   for el in python_list:
>>>       hs.add(el)
>>>   return hs
>>> 
>>> However I'm still looking for a more elegant constructor that would
>>> allow to create a HashSet from a python set (or list). Is that
>> available/possible?
>> 
>> In pylucene's python directory, there is a file called collections.py that
> has
>> what you're looking for, I think.
>> 
>> It's a Python class called JavaSet, that extends a PythonSet class which
> is an
>> extension point for the java.util.Set interface. PythonSet implements all
> the
>> java.util.Set methods by calling the corresponding python methods on the
>> JavaSet python class. PythonSet itself is defined in
>> java/org/apache/pylucene/util/PythonSet.java.
>> 
>> With this pair of classes you have a Python-backed set object being
>> integrated with Java via a java.util.Set implementation.
>> 
>>> The same holds for lists like the ArrayList (from java.util too) which
>>> implements the Collection interface:
>> 
>> And the very same could be done for java.util.ArrayList. It should be easy
>> enough by following the JavaSet/PythonSet example.
>> 
>> If you send in a patch that implements this, I'd be glad to integrate it !
>> 
>> Andi..
>> 
> 
> 

AW: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.
Hi Andi,

> ...
> And the very same could be done for java.util.ArrayList. It should be easy
> enough by following the JavaSet/PythonSet example.
> 
> If you send in a patch that implements this, I'd be glad to integrate it !

I've now implemented the PythonList as suggested - sorry for late reply, was
busy with other things and actually didn't need it myself, but think it's
useful anyway for JCC/PyLucene. 

I needed to add a PythonListIterator "wrapper" as well and needed to change
the pythonObject reference in PythonIterator from private to
    protected long pythonObject;

Hope that doesn't break anything... BTW, are there any tests for the
Java/PythonXXX classes yet? The build did run fine and I was able to
instantiate both JavaSet and JavaList in a Python shell (see example below).
There is one problem/BUG left though: when creating a JavaList instance from
a python list of ints a TypeError occurs. It does work for a list of str
(see below).

The toArray() method (implemented as in the JavaSet class) seems to be the
cause of the problem:

>>> l=range(3)
 [0, 1, 2]
>>> pl= collections.JavaList(l)
<JavaList: org.apache.pylucene.util.PythonList@12d96f2>
>>> pl.toArray()
[0, 1, 2]
>>> jl = lucene.ArrayList(pl)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
lucene.JavaError: org.apache.jcc.PythonException: ('while calling',
'toArray', [0, 1, 2])
TypeError: ('while calling', 'toArray', [0, 1, 2])  Java stacktrace:
org.apache.jcc.PythonException: ('while calling', 'toArray', [0, 1, 2])
TypeError: ('while calling', 'toArray', [0, 1, 2])
        at org.apache.pylucene.util.PythonList.toArray(Native Method)
        at java.util.ArrayList.<init>(ArrayList.java:131)

I guess the ints need to be casted to Objects somehow. Interestingly this is
done in the wrapped Java Classes like HashSet already:

>>> s = set(l)
>>> ps = collections.JavaSet(s)
<JavaSet: org.apache.pylucene.util.PythonSet@af993e>
>>> ps.toArray()
[0, 1, 2]
>>> js = lucene.HashSet(ps)
<HashSet: [0, 1, 2]>
>>> js.toArray()
JArray<object>[<Object: 0>, <Object: 1>, <Object: 2>]

I'm not sure how to fix this and would welcome suggestions.
Is there some helper method for type-safe 'Python2Java casting' that should
be used?

Some further (minor) remarks: I was wondering about "compatibility" with
Java interfaces, i.e.
 - do we need to implement this method?
    public native int hashCode();
 (currently not implemented by PythonList and PythonSet)

 - do we need to replace Python Exception with their Java pendant?
  e.g. IndexError -> IndexOutOfBoundsException
  I've added some comments in the code where this could (should?) be done:
  e.g.
	# TODO: raise JavaError for IndexOutOfBoundsException!?
	# TODO: raise JavaError for NoSuchElementException - if the
iteration has no next element!?
	
 - how to handle/implement methods with same signature (i.e. number of args)
in Python?
 e.g.
  public native Object remove(int index);
  public native boolean remove(Object obj);
 
In this particular case I've implemented one Python method and did a type
check, not sure if that's optimal or needed at all (does JCC handle this
already?).

TODO: 
- resolve TypeError for JavaList when created from python list of ints
- remove comments (or change Exceptions)
- write a test case (note: there was a bug/typo in JavaSet.retainAll() -
fixed)
- merge with trunk

Attached is a patch against
http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_5/
(revision 1292224)

I wouldn't recommend integrating the code until the mentioned bug is
resolved.
Of course I'm willing to finalize this properly but need some help at this
point.

kind regards

Thomas 
--
OrbiTeam Software GmbH & Co. KG
53121 Bonn - Germany
http://www.orbiteam.de
--
P.S.  And here is an example for those of you who ask "what are they talking
about?"  

>>> import lucene
>>> lucene.initVM()
<jcc.JCCEnv object at 0x01390AF0>
>>> l = ['a','b','c']
>>> import collections
>>> pl = collections.JavaList(l)
>>> pl.size()
3
>>> jl = lucene.ArrayList(pl)
>>> jl
<ArrayList: [a, b, c]>
>>>

now we have created an instance of a java.util.ArrayList with a python
"native" list (l) wrapped by the collections.JavaList as constructor
argument 

>>> s = set(l)
>>> ps = collections.JavaSet(s)
>>> ps
<JavaSet: org.apache.pylucene.util.PythonSet@160a26f>
>>> ps.size()
3
>>> js = lucene.HashSet(ps)
>>> js
<HashSet: [b, c, a]>


now we have created an instance of a java.utilHashSet with a python "native"
set (s) wrapped by the collections.JavaSet as constructor argument 

(BTW, I found it difficult to understand why one class implemented in Java
is called PythonSet whereas the one implemented in Python - and wraps the
Java pendant - is called JavaSet, but that's just a comment and depends on
the point of view)

> -----Ursprüngliche Nachricht-----
> Von: Andi Vajda [mailto:vajda@apache.org]
> Gesendet: Mittwoch, 1. Februar 2012 19:08
> An: pylucene-dev@lucene.apache.org
> Betreff: Re: AW: Setting Stopword Set in PyLucene (or using Set in
general)
> 
> 
>   Hi Thomas,
> 
> On Wed, 1 Feb 2012, Thomas Koch wrote:
> 
> > OK, I found a solution (obviously not the best one...): lucene.Set is
> > representing a java.util *interface* Set<E> which of course cannot be
> > instantiated. HashSet is an implementing class, and can be
> > instantiated. You can add elements via the add() method to the set then.
> Example:
> >
> > def get_lucene_set(python_list):
> >    """convert python list into lucene.Set (Java.util.set interface)
> >          using the HashSet class (java.util) wrapped in lucene.HashSet
> >    """
> >    hs = lucene.HashSet()
> >    for el in python_list:
> >        hs.add(el)
> >    return hs
> >
> > However I'm still looking for a more elegant constructor that would
> > allow to create a HashSet from a python set (or list). Is that
> available/possible?
> 
> In pylucene's python directory, there is a file called collections.py that
has
> what you're looking for, I think.
> 
> It's a Python class called JavaSet, that extends a PythonSet class which
is an
> extension point for the java.util.Set interface. PythonSet implements all
the
> java.util.Set methods by calling the corresponding python methods on the
> JavaSet python class. PythonSet itself is defined in
> java/org/apache/pylucene/util/PythonSet.java.
> 
> With this pair of classes you have a Python-backed set object being
> integrated with Java via a java.util.Set implementation.
> 
> > The same holds for lists like the ArrayList (from java.util too) which
> > implements the Collection interface:
> 
> And the very same could be done for java.util.ArrayList. It should be easy
> enough by following the JavaSet/PythonSet example.
> 
> If you send in a patch that implements this, I'd be glad to integrate it !
> 
> Andi..
> 



Re: AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Andi Vajda <va...@apache.org>.
  Hi Thomas,

On Wed, 1 Feb 2012, Thomas Koch wrote:

> OK, I found a solution (obviously not the best one...): lucene.Set is
> representing a java.util *interface* Set<E> which of course cannot be
> instantiated. HashSet is an implementing class, and can be instantiated. You
> can add elements via the add() method to the set then. Example:
>
> def get_lucene_set(python_list):
>    """convert python list into lucene.Set (Java.util.set interface)
>          using the HashSet class (java.util) wrapped in lucene.HashSet
>    """
>    hs = lucene.HashSet()
>    for el in python_list:
>        hs.add(el)
>    return hs
>
> However I'm still looking for a more elegant constructor that would allow to
> create a HashSet from a python set (or list). Is that available/possible?

In pylucene's python directory, there is a file called collections.py that 
has what you're looking for, I think.

It's a Python class called JavaSet, that extends a PythonSet class which is 
an extension point for the java.util.Set interface. PythonSet implements all 
the java.util.Set methods by calling the corresponding python methods on the 
JavaSet python class. PythonSet itself is defined in 
java/org/apache/pylucene/util/PythonSet.java.

With this pair of classes you have a Python-backed set object being 
integrated with Java via a java.util.Set implementation.

> The same holds for lists like the ArrayList (from java.util too) which
> implements the Collection interface:

And the very same could be done for java.util.ArrayList. It should be easy 
enough by following the JavaSet/PythonSet example.

If you send in a patch that implements this, I'd be glad to integrate it !

Andi..

>
> Example:
>>>> l =range(3)
>>>> l
> [0, 1, 2]
>>>> a = lucene.ArrayList(l)
>  Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  lucene.InvalidArgsError: (<type 'ArrayList'>, '__init__', ([0, 1, 2],))
>
> using the for-in-do obj.add "trick" allows to generate a 'filled' instance
> here as well : <ArrayList: [0, 1, 2]>
> but wouldn't it be nice to be able to create an instance more "pythonic"?
>
> I'm not a Java expert (nor do I know much about the Collections API), so
> maybe it's even impossible in Java to create an instance of a
> List,Vector,HashSet (whatever) and passing some literals (like Strings) -
> who knows...  So if anyone has a better idea how to do this in PyLucene
> please let me know ,-)
>
> regards,
> Thomas
>
>

AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.
> Arrays.asList converts java arrays to java lists, and you can pass a
python
> sequence to it.  From there, all of the collection constructors can be
passed
> other collections.
> 
Thanks, Aric - that helped a lot.
Will also look at the hints Andi send earlier this day.

Regards,
Thomas



Re: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Aric Coady <ar...@gmail.com>.
On 2012 Feb 1, at 3:07 AM, Thomas Koch wrote:
> OK, I found a solution (obviously not the best one...): lucene.Set is
> representing a java.util *interface* Set<E> which of course cannot be
> instantiated. HashSet is an implementing class, and can be instantiated. You
> can add elements via the add() method to the set then. Example:
> 
> def get_lucene_set(python_list):
>    """convert python list into lucene.Set (Java.util.set interface)
>          using the HashSet class (java.util) wrapped in lucene.HashSet
>    """
>    hs = lucene.HashSet()
>    for el in python_list:
>        hs.add(el)
>    return hs
> 
> However I'm still looking for a more elegant constructor that would allow to
> create a HashSet from a python set (or list). Is that available/possible?

Arrays.asList converts java arrays to java lists, and you can pass a python sequence to it.  From there, all of the collection constructors can be passed other collections.

>>> lucene.Arrays.asList('abc')
<List: [a, b, c]>
>>> lucene.HashSet(lucene.Arrays.asList('abc'))
<HashSet: [b, c, a]>

> The same holds for lists like the ArrayList (from java.util too) which
> implements the Collection interface:
> 
> Example:
>>>> l =range(3)
>>>> l
> [0, 1, 2]
>>>> a = lucene.ArrayList(l)
>  Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  lucene.InvalidArgsError: (<type 'ArrayList'>, '__init__', ([0, 1, 2],))
> 
> using the for-in-do obj.add "trick" allows to generate a 'filled' instance
> here as well : <ArrayList: [0, 1, 2]>
> but wouldn't it be nice to be able to create an instance more "pythonic"?
> 
> I'm not a Java expert (nor do I know much about the Collections API), so
> maybe it's even impossible in Java to create an instance of a
> List,Vector,HashSet (whatever) and passing some literals (like Strings) -
> who knows...  So if anyone has a better idea how to do this in PyLucene
> please let me know ,-)
> 
> regards,
> Thomas
> 
> 


AW: Setting Stopword Set in PyLucene (or using Set in general)

Posted by Thomas Koch <ko...@orbiteam.de>.
OK, I found a solution (obviously not the best one...): lucene.Set is
representing a java.util *interface* Set<E> which of course cannot be
instantiated. HashSet is an implementing class, and can be instantiated. You
can add elements via the add() method to the set then. Example:

def get_lucene_set(python_list):
    """convert python list into lucene.Set (Java.util.set interface)
          using the HashSet class (java.util) wrapped in lucene.HashSet
    """
    hs = lucene.HashSet()
    for el in python_list:
        hs.add(el)
    return hs

However I'm still looking for a more elegant constructor that would allow to
create a HashSet from a python set (or list). Is that available/possible?

The same holds for lists like the ArrayList (from java.util too) which
implements the Collection interface:

Example:
>>> l =range(3)
>>> l
[0, 1, 2]
>>> a = lucene.ArrayList(l)
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  lucene.InvalidArgsError: (<type 'ArrayList'>, '__init__', ([0, 1, 2],))
  
using the for-in-do obj.add "trick" allows to generate a 'filled' instance
here as well : <ArrayList: [0, 1, 2]>
but wouldn't it be nice to be able to create an instance more "pythonic"?

I'm not a Java expert (nor do I know much about the Collections API), so
maybe it's even impossible in Java to create an instance of a
List,Vector,HashSet (whatever) and passing some literals (like Strings) -
who knows...  So if anyone has a better idea how to do this in PyLucene
please let me know ,-)

regards,
Thomas