You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/04/28 07:28:58 UTC
SpanScorer Not implemented Error
Hello,
I am trying to get a list of all terms that matched a document. So, if I
search for "John Smith", I want to know if I found "John Smith" specifically
in the document. I can use the lucene results but I need to do more
processing based on exactly what was found. I am using a highlighter and
formatter for this, but if I use the QueryScorer it breaks up the phrase
into "John" and "Smith", but only if the whole name was found. I have
uncovered that maybe the SpanScorer would preserve the whole phrase, but
when I try to use it I get NotImplementedError. Has it not been interfaced
yet? Is it a difficult thing to do?
Thanks,
Max
Re: SpanScorer Not implemented Error
Posted by Andi Vajda <va...@apache.org>.
On Tue, 28 Apr 2009, Max Lynch wrote:
> Here is an example that is failing. However, by just doing a dir(SpanScorer
> ) with your console commands show it has no extra attributes other than the
> base java object ones, and it is using the spans version of the code.
Indeed, the name clash between org.apache.lucene.search.spans.SpanScorer and
org.apache.lucene.search.highlight.SpanScorer was the problem.
I added support for a new command line argument ot JCC, called --rename,
that makes it possible to list one (or more, comma-separated) classes to
rename for the Python wrapper.
For example, I added this to the PyLucene build:
--rename org.apache.lucene.search.highlight.SpanScorer=HighlighterSpanScorer
to rename org.apache.lucene.search.highlight.SpanScorer to
HighlighterSpanScorer.
Your example code then ran (after modifying it to use the new class name).
Andi..
>
> # ====================================================================
> # Licensed under the Apache License, Version 2.0 (the "License");
> # you may not use this file except in compliance with the License.
>
> # You may obtain a copy of the License at
> #
> # http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
>
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
>
> # limitations under the License.
> # ====================================================================
> from unittest import TestCase, main
> from lucene import *
> class TestFormatter(PythonFormatter):
>
> def __init__(self, testCase):
> super(TestFormatter, self).__init__()
> self.testCase = testCase
> def highlightTerm(self, originalText, group):
> if group.getTotalScore() <= 0:
> return originalText;
>
> self.testCase.countHighlightTerm(originalText)
> return "<b>" + originalText + "</b>"
> class HighlighterTestCase(TestCase):
> """
> Unit tests ported from Java Lucene.
>
> 2004 by Yura Smolsky ;)
> """
> FIELD_NAME = "contents"
> texts = [ "A wicked problem is one for which each attempt to create a soluti
> on changes the understanding of the problem. Wicked problems cannot be solv
> ed in a traditional linear fashion, because the problem definition evolves a
> s new possible solutions are considered and/or implemented."
>
> "Wicked problems always occur in a social context -- the wickedness of the p
> roblem reflects the diversity among the stakeholders in the problem."
> "From http://cognexus.org/id42.htm"
>
> "Most projects in organizations -- and virtually all technology-related proj
> ects these days -- are about wicked problems. Indeed, it is the social comp
> lexity of these problems, not their technical complexity, that overwhelms mo
> st current problem solving and project management approaches."
>
> "This text has a typo in referring to whicked problems" ];
> def __init__(self, *args):
> super(HighlighterTestCase, self).__init__(*args)
> self.foundList = []
> self.parser = QueryParser(self.FIELD_NAME, StandardAnalyzer())
>
> def testSimpleHighlighter(self):
> self.doSearching("wicked")
> formatter = TestFormatter(self)
> for i in range(0, self.hits.length()):
> self.foundList = []
> text = self.hits.doc(i).get(self.FIELD_NAME)
>
> tokenStream = self.analyzer.tokenStream(self.FIELD_NAME,
> StringReader(text))
> ctokenStream = CachingTokenFilter(tokenStream)
> highlighter = Highlighter(formatter, SpanScorer(self.query, self.FIELD_NAME,
> ctokenStream))
>
> result = highlighter.getBestFragments(tokenStream, text,
> maxNumFragmentsRequired,
> "...")
> print "\t", result
> print "Found: ",
> print self.foundList
>
> def doSearching(self, queryString):
> searcher = IndexSearcher(self.ramDir)
> self.query = self.parser.parse(queryString)
> #self.query = PhraseQuery()
> #self.query = queries[0]
> #self.query.combine(queries)
>
> # for any multi-term queries to work (prefix, wildcard, range,
> # fuzzy etc) you must use a rewritten query!
> self.query = self.query.rewrite(self.reader)
> print "Searching for:", self.query.toString(self.FIELD_NAME)
>
> self.hits = searcher.search(self.query)
> self.numHighlights = 0
> def countHighlightTerm(self, found):
> self.foundList.append(found)
> self.numHighlights += 1 # update stats used in assertions
>
> def setUp(self):
> self.analyzer=StandardAnalyzer()
> self.ramDir = RAMDirectory()
> writer = IndexWriter(self.ramDir, self.analyzer, True)
> for text in self.texts:
> self.addDoc(writer, text)
>
> writer.optimize()
> writer.close()
> self.reader = IndexReader.open(self.ramDir)
> self.numHighlights = 0;
> def addDoc(self, writer, text):
> d = Document()
> f = Field(self.FIELD_NAME, text,
>
> Field.Store.YES, Field.Index.TOKENIZED,
> Field.TermVector.YES)
> d.add(f)
> writer.addDocument(d)
> if __name__ == "__main__":
> import sys, lucene
> lucene.initVM(lucene.CLASSPATH)
>
> if '-loop' in sys.argv:
> sys.argv.remove('-loop')
> while True:
> try:
> main()
> except:
> pass
> else:
> main()
>
>
Re: SpanScorer Not implemented Error
Posted by Max Lynch <ih...@gmail.com>.
Here is an example that is failing. However, by just doing a
dir(SpanScorer) with your console commands show it has no extra
attributes other than the base java object ones, and it is using the
spans version of the code.
# ====================================================================
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ====================================================================
from unittest import TestCase, main
from lucene import *
class TestFormatter(PythonFormatter):
def __init__(self, testCase):
super(TestFormatter, self).__init__()
self.testCase = testCase
def highlightTerm(self, originalText, group):
if group.getTotalScore() <= 0:
return originalText;
self.testCase.countHighlightTerm(originalText)
return "<b>" + originalText + "</b>"
class HighlighterTestCase(TestCase):
"""
Unit tests ported from Java Lucene.
2004 by Yura Smolsky ;)
"""
FIELD_NAME = "contents"
texts = [ "A wicked problem is one for which each attempt to create a
solution changes the understanding of the problem. Wicked problems
cannot be solved in a traditional linear fashion, because the problem
definition evolves as new possible solutions are considered and/or
implemented."
"Wicked problems always occur in a social context -- the wickedness
of the problem reflects the diversity among the stakeholders in the
problem."
"From http://cognexus.org/id42.htm"
"Most projects in organizations -- and virtually all
technology-related projects these days -- are about wicked problems.
Indeed, it is the social complexity of these problems, not their
technical complexity, that overwhelms most current problem solving and
project management approaches."
"This text has a typo in referring to whicked problems" ];
def __init__(self, *args):
super(HighlighterTestCase, self).__init__(*args)
self.foundList = []
self.parser = QueryParser(self.FIELD_NAME, StandardAnalyzer())
def testSimpleHighlighter(self):
self.doSearching("wicked")
formatter = TestFormatter(self)
for i in range(0, self.hits.length()):
self.foundList = []
text = self.hits.doc(i).get(self.FIELD_NAME)
tokenStream = self.analyzer.tokenStream(self.FIELD_NAME,
StringReader(text))
ctokenStream = CachingTokenFilter(tokenStream)
highlighter = Highlighter(formatter, SpanScorer(self.query,
self.FIELD_NAME, ctokenStream))
result = highlighter.getBestFragments(tokenStream, text,
maxNumFragmentsRequired,
"...")
print "\t", result
print "Found: ",
print self.foundList
def doSearching(self, queryString):
searcher = IndexSearcher(self.ramDir)
self.query = self.parser.parse(queryString)
#self.query = PhraseQuery()
#self.query = queries[0]
#self.query.combine(queries)
# for any multi-term queries to work (prefix, wildcard, range,
# fuzzy etc) you must use a rewritten query!
self.query = self.query.rewrite(self.reader)
print "Searching for:", self.query.toString(self.FIELD_NAME)
self.hits = searcher.search(self.query)
self.numHighlights = 0
def countHighlightTerm(self, found):
self.foundList.append(found)
self.numHighlights += 1 # update stats used in assertions
def setUp(self):
self.analyzer=StandardAnalyzer()
self.ramDir = RAMDirectory()
writer = IndexWriter(self.ramDir, self.analyzer, True)
for text in self.texts:
self.addDoc(writer, text)
writer.optimize()
writer.close()
self.reader = IndexReader.open(self.ramDir)
self.numHighlights = 0;
def addDoc(self, writer, text):
d = Document()
f = Field(self.FIELD_NAME, text,
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.YES)
d.add(f)
writer.addDocument(d)
if __name__ == "__main__":
import sys, lucene
lucene.initVM(lucene.CLASSPATH)
if '-loop' in sys.argv:
sys.argv.remove('-loop')
while True:
try:
main()
except:
pass
else:
main()
Re: SpanScorer Not implemented Error
Posted by Andi Vajda <va...@apache.org>.
On Tue, 28 Apr 2009, Max Lynch wrote:
>>
>> I noticed that in the lucene-java-2.4.1 svn release that was downloaded
>> when I built pylucene, there is no highlight folder anymore:
>> http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/search/
>>
>> From what I can tell though (From the build headers), pylucene thinks
>> SpanScorer is in spans and highlighter
>>
>>
>>
> Oh, I see that the highlighter package is a separate jar.
>
> Do you think there is an easy fix for the short term? Or should I try a
> different method in pylucene?
I can only repeat what I just said:
- please send me a piece of code to reproduce the problem
- if it's indeed a name clash, no, I don't think it'd be too hard to
enhance JCC to make it possible to rename name-clashing classes
Thanks !
Andi..
Re: SpanScorer Not implemented Error
Posted by Max Lynch <ih...@gmail.com>.
>
> I noticed that in the lucene-java-2.4.1 svn release that was downloaded
> when I built pylucene, there is no highlight folder anymore:
> http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/search/
>
> From what I can tell though (From the build headers), pylucene thinks
> SpanScorer is in spans and highlighter
>
>
>
Oh, I see that the highlighter package is a separate jar.
Do you think there is an easy fix for the short term? Or should I try a
different method in pylucene?
Thanks,
Max
Re: SpanScorer Not implemented Error
Posted by Max Lynch <ih...@gmail.com>.
> If you are trying to use the highlighter package's SpanScorer class, there
> may be a problem with it clashing (by name) with the
> org.apache.lucene.search.spans.SpanScorer class:
>
> >>> import lucene
> >>> lucene.initVM(lucene.CLASSPATH)
> >>> lucene.SpanScorer.class_
> <Class: class org.apache.lucene.search.spans.SpanScorer>
I noticed that in the lucene-java-2.4.1 svn release that was downloaded when
I built pylucene, there is no highlight folder anymore:
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/search/
>From what I can tell though (From the build headers), pylucene thinks
SpanScorer is in spans and highlighter
>
>
> But without a specific example of what you're trying to do, it's mostly
> just guesswork here.
>
> Could you please include a piece of code that reproduces the problem ?
> Thanks !
>
for i in range(0, self.hits.length()):
text = self.hits.doc(i).get(self.FIELD_NAME)
tokenStream = self.analyzer.tokenStream(self.FIELD_NAME,
StringReader(text))
ctokenStream = CachingTokenFilter(tokenStream)
highlighter = Highlighter(formatter, SpanScorer(self.query,
self.FIELD_NAME, ctokenStream))
result = highlighter.getBestFragments(tokenStream, text,
maxNumFragmentsRequired,
"...")
print "\t", result
print "Found: ",
It's the test_Highlighter.py code but I've changed it to try to test the
SpanScorer class.
Thanks,
Max
Re: SpanScorer Not implemented Error
Posted by Andi Vajda <va...@apache.org>.
On Tue, 28 Apr 2009, Max Lynch wrote:
> I am trying to get a list of all terms that matched a document. So, if I
> search for "John Smith", I want to know if I found "John Smith" specifically
> in the document. I can use the lucene results but I need to do more
> processing based on exactly what was found. I am using a highlighter and
> formatter for this, but if I use the QueryScorer it breaks up the phrase
> into "John" and "Smith", but only if the whole name was found. I have
> uncovered that maybe the SpanScorer would preserve the whole phrase, but
> when I try to use it I get NotImplementedError. Has it not been interfaced
> yet? Is it a difficult thing to do?
If you are trying to use the highlighter package's SpanScorer class, there
may be a problem with it clashing (by name) with the
org.apache.lucene.search.spans.SpanScorer class:
>>> import lucene
>>> lucene.initVM(lucene.CLASSPATH)
>>> lucene.SpanScorer.class_
<Class: class org.apache.lucene.search.spans.SpanScorer>
But without a specific example of what you're trying to do, it's mostly
just guesswork here.
If I guessed this right, enhancing JCC so that specific classes involved in
a name clash can be renamed in Python (because java packages are flattened
out in Python, yet not in the underlying generated C++) shouldn't be too
hard.
Could you please include a piece of code that reproduces the problem ?
Thanks !
Andi..