You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Valery Khamenya <kh...@gmail.com> on 2009/08/14 12:33:26 UTC

What is the best way to call Lucene from Python?

Hi
what would be the best way to call Lucene from Python application?

Is PyLucene really a good way for it?

In particular:

What about PyLucene's scalability?

What about PyLucene vs Lucene performance?
(this post is quite old: http://markmail.org/message/5pjbs7mdh4fpvsjb)

best regards
--
Valery A.Khamenya

Re: What is the best way to call Lucene from Python?

Posted by Valery Khamenya <kh...@gmail.com>.

OK, Andi, I see, thank you!
best regards
--
Valery A.Khamenya


On Sat, Aug 15, 2009 at 5:18 PM, Andi Vajda <va...@apache.org> wrote:

> It looks like a tight loop crossing the python/java barrier for every token
> in the input text. Sure, that can be slow.
>
> If that's what your app needs to do, it's better to write that loop in
> Java, generate Python wrapper access code to it with JCC and invoke it that
> way.
>
> Andi..
>
>
> On Aug 15, 2009, at 15:52, Valery Khamenya <kh...@gmail.com> wrote:
>
>  Hi Andi,
>>
>>  If you have any questions, feel free to ask this list.
>>>
>> and here we go! :)
>>
>> I've benchmarked the StandardAnalyzer against a 3Mb text file. This test
>> (se
>> below) was executed both in PyLucene and in *plain* Lucene, i.e. in Java.
>>
>> The execution time in Java was 1.7 sec, whereas the PyLucene test below on
>> the same machine was 37sec.
>> 20 times slower is quite a lot. Would it mean, that one should rather kick
>> the "explain" function out of the Python scope to avoid the wrapping time
>> overhead? Or maybe I'm doing smth wrong here? Hints are welcome :)
>>
>> ##### PyLucene version of the test
>> from unittest import TestCase, main
>> import codecs
>> from lucene import *
>>
>> class MyFirstTest(TestCase) :
>>   """
>>   """
>>
>>   def tokenizeContent(self, content):
>>       analyzer = StandardAnalyzer()
>>       tokenStream = analyzer.tokenStream("dummy", StringReader(content))
>>       self.explain(tokenStream)
>>
>>   def testMy1(self):
>>       f = codecs.open("./3Mb-monolith.txt", 'r', "utf-8")
>>       content = f.read()
>>       f.close()
>>       for i in range(10):
>>           self.tokenizeContent(content)
>>
>>   def explain(self, ts):
>>       status = True
>>       for t in ts:
>>           t.termText()
>>
>>
>> if __name__ == "__main__":
>>   import sys, lucene
>>   lucene.initVM(lucene.CLASSPATH)
>>   if '-loop' in sys.argv:
>>       sys.argv.remove('-loop')
>>       while True:
>>           try:
>>               main()
>>           except:
>>               pass
>>   else:
>>       main()
>>
>> //////////////////////////////////////////////////////////////
>> // Java Lucene version of the same test
>>
>> package my.test;
>>
>> import java.io.IOException;
>>
>> public class MyFirstTest {
>>
>> @Test
>> public void readAAJob() throws IOException {
>>  InputStream resourceAsStream = getClass().getResourceAsStream(
>>   "/3Mb-monolith.txt");
>>  String content = IOUtils.toString(resourceAsStream);
>>
>>  for (int i = 0; i < 100; i++)
>>  tokenizeContent(content);
>>
>> }
>>
>> private void tokenizeContent(String content) throws IOException {
>>  StandardAnalyzer analyzer = new StandardAnalyzer();
>>
>>  TokenStream tokenStream = analyzer.tokenStream("dummy",
>>   new StringReader(content));
>>
>>  explain(tokenStream);
>> }
>>
>> public void explain(TokenStream ts) throws IOException {
>>  Token token = new Token();
>>  int i = 0;
>>  while ((token = ts.next(token)) != null) {
>>  // System.out.println("Token[ " + i + " ] = " + token.term());
>>  i++;
>>  }
>> }
>> }
>>
>>
>> best regards
>> --
>> Valery A.Khamenya
>>
>

Re: What is the best way to call Lucene from Python?

Posted by Andi Vajda <va...@apache.org>.

It looks like a tight loop crossing the python/java barrier for every  
token in the input text. Sure, that can be slow.

If that's what your app needs to do, it's better to write that loop in  
Java, generate Python wrapper access code to it with JCC and invoke it  
that way.

Andi..

On Aug 15, 2009, at 15:52, Valery Khamenya <kh...@gmail.com> wrote:

> Hi Andi,
>
>> If you have any questions, feel free to ask this list.
> and here we go! :)
>
> I've benchmarked the StandardAnalyzer against a 3Mb text file. This  
> test (se
> below) was executed both in PyLucene and in *plain* Lucene, i.e. in  
> Java.
>
> The execution time in Java was 1.7 sec, whereas the PyLucene test  
> below on
> the same machine was 37sec.
> 20 times slower is quite a lot. Would it mean, that one should  
> rather kick
> the "explain" function out of the Python scope to avoid the wrapping  
> time
> overhead? Or maybe I'm doing smth wrong here? Hints are welcome :)
>
> ##### PyLucene version of the test
> from unittest import TestCase, main
> import codecs
> from lucene import *
>
> class MyFirstTest(TestCase) :
>    """
>    """
>
>    def tokenizeContent(self, content):
>        analyzer = StandardAnalyzer()
>        tokenStream = analyzer.tokenStream("dummy",  
> StringReader(content))
>        self.explain(tokenStream)
>
>    def testMy1(self):
>        f = codecs.open("./3Mb-monolith.txt", 'r', "utf-8")
>        content = f.read()
>        f.close()
>        for i in range(10):
>            self.tokenizeContent(content)
>
>    def explain(self, ts):
>        status = True
>        for t in ts:
>            t.termText()
>
>
> if __name__ == "__main__":
>    import sys, lucene
>    lucene.initVM(lucene.CLASSPATH)
>    if '-loop' in sys.argv:
>        sys.argv.remove('-loop')
>        while True:
>            try:
>                main()
>            except:
>                pass
>    else:
>        main()
>
> //////////////////////////////////////////////////////////////
> // Java Lucene version of the same test
>
> package my.test;
>
> import java.io.IOException;
>
> public class MyFirstTest {
>
> @Test
> public void readAAJob() throws IOException {
>  InputStream resourceAsStream = getClass().getResourceAsStream(
>    "/3Mb-monolith.txt");
>  String content = IOUtils.toString(resourceAsStream);
>
>  for (int i = 0; i < 100; i++)
>   tokenizeContent(content);
>
> }
>
> private void tokenizeContent(String content) throws IOException {
>  StandardAnalyzer analyzer = new StandardAnalyzer();
>
>  TokenStream tokenStream = analyzer.tokenStream("dummy",
>    new StringReader(content));
>
>  explain(tokenStream);
> }
>
> public void explain(TokenStream ts) throws IOException {
>  Token token = new Token();
>  int i = 0;
>  while ((token = ts.next(token)) != null) {
>   // System.out.println("Token[ " + i + " ] = " + token.term());
>   i++;
>  }
> }
> }
>
>
> best regards
> --
> Valery A.Khamenya

Re: What is the best way to call Lucene from Python?

Posted by Valery Khamenya <kh...@gmail.com>.

Hi Andi,

 > If you have any questions, feel free to ask this list.
and here we go! :)

I've benchmarked the StandardAnalyzer against a 3Mb text file. This test (se
below) was executed both in PyLucene and in *plain* Lucene, i.e. in Java.

The execution time in Java was 1.7 sec, whereas the PyLucene test below on
the same machine was 37sec.
20 times slower is quite a lot. Would it mean, that one should rather kick
the "explain" function out of the Python scope to avoid the wrapping time
overhead? Or maybe I'm doing smth wrong here? Hints are welcome :)

##### PyLucene version of the test
from unittest import TestCase, main
import codecs
from lucene import *

class MyFirstTest(TestCase) :
    """
    """

    def tokenizeContent(self, content):
        analyzer = StandardAnalyzer()
        tokenStream = analyzer.tokenStream("dummy", StringReader(content))
        self.explain(tokenStream)

    def testMy1(self):
        f = codecs.open("./3Mb-monolith.txt", 'r', "utf-8")
        content = f.read()
        f.close()
        for i in range(10):
            self.tokenizeContent(content)

    def explain(self, ts):
        status = True
        for t in ts:
            t.termText()


if __name__ == "__main__":
    import sys, lucene
    lucene.initVM(lucene.CLASSPATH)
    if '-loop' in sys.argv:
        sys.argv.remove('-loop')
        while True:
            try:
                main()
            except:
                pass
    else:
        main()

//////////////////////////////////////////////////////////////
// Java Lucene version of the same test

package my.test;

import java.io.IOException;

public class MyFirstTest {

 @Test
 public void readAAJob() throws IOException {
  InputStream resourceAsStream = getClass().getResourceAsStream(
    "/3Mb-monolith.txt");
  String content = IOUtils.toString(resourceAsStream);

  for (int i = 0; i < 100; i++)
   tokenizeContent(content);

 }

 private void tokenizeContent(String content) throws IOException {
  StandardAnalyzer analyzer = new StandardAnalyzer();

  TokenStream tokenStream = analyzer.tokenStream("dummy",
    new StringReader(content));

  explain(tokenStream);
 }

 public void explain(TokenStream ts) throws IOException {
  Token token = new Token();
  int i = 0;
  while ((token = ts.next(token)) != null) {
   // System.out.println("Token[ " + i + " ] = " + token.term());
   i++;
  }
 }
}


best regards
--
Valery A.Khamenya

Re: What is the best way to call Lucene from Python?

Posted by Andi Vajda <va...@apache.org>.

On Aug 15, 2009, at 11:33, Valery Khamenya <kh...@gmail.com> wrote:

> Hi Andi,
> thanks for reply. I just got my hands on the "Lucene in Anction"  
> book and
> some questions disappeared.
>
> A great thing you did, Andi, PyLucene is a wonderful piece of work,  
> really,
> thank you.

You're very welcome !

> Well, I plan to use PyLucene rather more for the top-level API- 
> calls. That
> is, not that much data throughput, no huge number of calls.  
> Extensions, if
> any, would be written in Java.
>
> So, now I am reading about scalability approaches in Lucene and, I  
> hope, it
> will work in PyLucene too.

If you have any questions, feel free to ask this list. Lucene usage  
questions are best answered on the java-user@lucene.apache.org list,  
of course, since it has a larger audience.

Kind regards.

Andi..

>
>
> best regards
> --
> Valery A.Khamenya
>
>
> On Fri, Aug 14, 2009 at 10:45 PM, Andi Vajda <va...@apache.org> wrote:
>
>>
>> On Aug 14, 2009, at 12:33, Valery Khamenya <kh...@gmail.com>  
>> wrote:
>>
>> Hi
>>> what would be the best way to call Lucene from Python application?
>>>
>>> Is PyLucene really a good way for it?
>>>
>>> In particular:
>>>
>>> What about PyLucene's scalability?
>>>
>>> What about PyLucene vs Lucene performance?
>>> (this post is quite old: http://markmail.org/message/5pjbs7mdh4fpvsjb 
>>> )
>>>
>>
>> PyLucene is a Python wrapper around Java Lucene. A Java VM is  
>> embedded in
>> the Python VM's process. Its performance and scalability are very  
>> similar to
>> regular Java Lucene except when crossing the Python/Java VM  
>> boundary when
>> writing Lucene extensions in Python.
>>
>> Andi..
>>
>>
>>
>>>
>>> best regards
>>> --
>>> Valery A.Khamenya
>>>
>>

Re: What is the best way to call Lucene from Python?

Posted by Valery Khamenya <kh...@gmail.com>.

Hi Andi,
thanks for reply. I just got my hands on the "Lucene in Anction" book and
some questions disappeared.

A great thing you did, Andi, PyLucene is a wonderful piece of work, really,
thank you.

Well, I plan to use PyLucene rather more for the top-level API-calls. That
is, not that much data throughput, no huge number of calls. Extensions, if
any, would be written in Java.

So, now I am reading about scalability approaches in Lucene and, I hope, it
will work in PyLucene too.

best regards
--
Valery A.Khamenya

On Fri, Aug 14, 2009 at 10:45 PM, Andi Vajda <va...@apache.org> wrote:

>
> On Aug 14, 2009, at 12:33, Valery Khamenya <kh...@gmail.com> wrote:
>
>  Hi
>> what would be the best way to call Lucene from Python application?
>>
>> Is PyLucene really a good way for it?
>>
>> In particular:
>>
>> What about PyLucene's scalability?
>>
>> What about PyLucene vs Lucene performance?
>> (this post is quite old: http://markmail.org/message/5pjbs7mdh4fpvsjb)
>>
>
> PyLucene is a Python wrapper around Java Lucene. A Java VM is embedded in
> the Python VM's process. Its performance and scalability are very similar to
> regular Java Lucene except when crossing the Python/Java VM boundary when
> writing Lucene extensions in Python.
>
> Andi..
>
>
>
>>
>> best regards
>> --
>> Valery A.Khamenya
>>
>

Re: What is the best way to call Lucene from Python?

Posted by Andi Vajda <va...@apache.org>.

On Aug 14, 2009, at 12:33, Valery Khamenya <kh...@gmail.com> wrote:

> Hi
> what would be the best way to call Lucene from Python application?
>
> Is PyLucene really a good way for it?
>
> In particular:
>
> What about PyLucene's scalability?
>
> What about PyLucene vs Lucene performance?
> (this post is quite old: http://markmail.org/message/5pjbs7mdh4fpvsjb)

PyLucene is a Python wrapper around Java Lucene. A Java VM is embedded  
in the Python VM's process. Its performance and scalability are very  
similar to regular Java Lucene except when crossing the Python/Java VM  
boundary when writing Lucene extensions in Python.

Andi..

>
>
> best regards
> --
> Valery A.Khamenya