You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/04/28 07:28:58 UTC

SpanScorer Not implemented Error

Hello,

I am trying to get a list of all terms that matched a document.  So, if I
search for "John Smith", I want to know if I found "John Smith" specifically
in the document.  I can use the lucene results but I need to do more
processing based on exactly what was found.  I am using a highlighter and
formatter for this, but if I use the QueryScorer it breaks up the phrase
into "John" and "Smith", but only if the whole name was found.  I have
uncovered that maybe the SpanScorer would preserve the whole phrase, but
when I try to use it I get NotImplementedError.  Has it not been interfaced
yet?  Is it a difficult thing to do?

Thanks,
Max

Re: SpanScorer Not implemented Error

Posted by Andi Vajda <va...@apache.org>.

On Tue, 28 Apr 2009, Max Lynch wrote:

> Here is an example that is failing.  However, by just doing a dir(SpanScorer
> ) with your console commands show it has no extra attributes other than the 
> base java object ones, and it is using the spans version of the code.

Indeed, the name clash between org.apache.lucene.search.spans.SpanScorer and
org.apache.lucene.search.highlight.SpanScorer was the problem.

I added support for a new command line argument ot JCC, called --rename, 
that makes it possible to list one (or more, comma-separated) classes to 
rename for the Python wrapper.

For example, I added this to the PyLucene build:
   --rename org.apache.lucene.search.highlight.SpanScorer=HighlighterSpanScorer

to rename org.apache.lucene.search.highlight.SpanScorer to 
HighlighterSpanScorer.

Your example code then ran (after modifying it to use the new class name).

Andi..

> 
> # ====================================================================
> #   Licensed under the Apache License, Version 2.0 (the "License");
> #   you may not use this file except in compliance with the License.
> 
> #   You may obtain a copy of the License at
> #
> #       http://www.apache.org/licenses/LICENSE-2.0
> #
> #   Unless required by applicable law or agreed to in writing, software
> 
> #   distributed under the License is distributed on an "AS IS" BASIS,
> #   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> #   See the License for the specific language governing permissions and
> 
> #   limitations under the License.
> # ====================================================================
> from unittest import TestCase, main
> from lucene import *
> class TestFormatter(PythonFormatter):
>
> 	def __init__(self, testCase):
> 		super(TestFormatter, self).__init__()
> 		self.testCase = testCase
> 	def highlightTerm(self, originalText, group):
> 		if group.getTotalScore() <= 0:
> 			return originalText;
>
> 		self.testCase.countHighlightTerm(originalText)
> 		return "<b>" + originalText + "</b>"
> class HighlighterTestCase(TestCase):
> 	"""
> 	Unit tests ported from Java Lucene.
>
> 	2004 by Yura Smolsky ;)
> 	"""
> 	FIELD_NAME = "contents"
> 	texts = [ "A wicked problem is one for which each attempt to create a soluti
> on changes the understanding of the problem.  Wicked problems cannot be solv
> ed in a traditional linear fashion, because the problem definition evolves a
> s new possible solutions are considered and/or implemented."
>
> 			"Wicked problems always occur in a social context -- the wickedness of the p
> roblem reflects the diversity among the stakeholders in the problem."
> 			"From http://cognexus.org/id42.htm"
>
> 			"Most projects in organizations -- and virtually all technology-related proj
> ects these days -- are about wicked problems.  Indeed, it is the social comp
> lexity of these problems, not their technical complexity, that overwhelms mo
> st current problem solving and project management approaches."
>
> 			"This text has a typo in referring to whicked problems" ];
> 	def __init__(self, *args):
> 		super(HighlighterTestCase, self).__init__(*args)
> 		self.foundList = []
> 		self.parser = QueryParser(self.FIELD_NAME, StandardAnalyzer())
>
> 	def testSimpleHighlighter(self):
> 		self.doSearching("wicked")
> 		formatter = TestFormatter(self)
> 		for i in range(0, self.hits.length()):
> 			self.foundList = []
> 			text = self.hits.doc(i).get(self.FIELD_NAME)
>
> 			tokenStream = self.analyzer.tokenStream(self.FIELD_NAME,
> 					StringReader(text))
> 			ctokenStream = CachingTokenFilter(tokenStream)
> 			highlighter = Highlighter(formatter, SpanScorer(self.query, self.FIELD_NAME,
>  ctokenStream))
>
> 			result = highlighter.getBestFragments(tokenStream, text,
> 					maxNumFragmentsRequired,
> 					"...")
> 			print "\t", result
> 			print "Found: ",
> 			print self.foundList
>
> 	def doSearching(self, queryString):
> 		searcher = IndexSearcher(self.ramDir)
> 		self.query = self.parser.parse(queryString)
> 		#self.query = PhraseQuery()
> 		#self.query = queries[0]
> 		#self.query.combine(queries)
>
> 		# for any multi-term queries to work (prefix, wildcard, range,
> 		# fuzzy etc) you must use a rewritten query!
> 		self.query = self.query.rewrite(self.reader)
> 		print "Searching for:", self.query.toString(self.FIELD_NAME)
>
> 		self.hits = searcher.search(self.query)
> 		self.numHighlights = 0
> 	def countHighlightTerm(self, found):
> 		self.foundList.append(found)
> 		self.numHighlights += 1 # update stats used in assertions
>
> 	def setUp(self):
> 		self.analyzer=StandardAnalyzer()
> 		self.ramDir = RAMDirectory()
> 		writer = IndexWriter(self.ramDir, self.analyzer, True)
> 		for text in self.texts:
> 			self.addDoc(writer, text)
>
> 		writer.optimize()
> 		writer.close()
> 		self.reader = IndexReader.open(self.ramDir)
> 		self.numHighlights = 0;
> 	def addDoc(self, writer, text):
> 		d = Document()
> 		f = Field(self.FIELD_NAME, text,
>
> 				Field.Store.YES, Field.Index.TOKENIZED,
> 				Field.TermVector.YES)
> 		d.add(f)
> 		writer.addDocument(d)
> if __name__ == "__main__":
> 	import sys, lucene
> 	lucene.initVM(lucene.CLASSPATH)
>
> 	if '-loop' in sys.argv:
> 		sys.argv.remove('-loop')
> 		while True:
> 			try:
> 				main()
> 			except:
> 				pass
> 	else:
> 		main()
> 
>

Re: SpanScorer Not implemented Error

Posted by Max Lynch <ih...@gmail.com>.

Here is an example that is failing.  However, by just doing a
dir(SpanScorer) with your console commands show it has no extra
attributes other than the base java object ones, and it is using the
spans version of the code.




# ====================================================================
#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.
# ====================================================================

from unittest import TestCase, main
from lucene import *


class TestFormatter(PythonFormatter):

	def __init__(self, testCase):
		super(TestFormatter, self).__init__()
		self.testCase = testCase

	def highlightTerm(self, originalText, group):
		if group.getTotalScore() <= 0:
			return originalText;

		self.testCase.countHighlightTerm(originalText)

		return "<b>" + originalText + "</b>"


class HighlighterTestCase(TestCase):
	"""
	Unit tests ported from Java Lucene.
	2004 by Yura Smolsky ;)
	"""

	FIELD_NAME = "contents"
	texts = [ "A wicked problem is one for which each attempt to create a
solution changes the understanding of the problem.  Wicked problems
cannot be solved in a traditional linear fashion, because the problem
definition evolves as new possible solutions are considered and/or
implemented."
			"Wicked problems always occur in a social context -- the wickedness
of the problem reflects the diversity among the stakeholders in the
problem."
			"From http://cognexus.org/id42.htm"
			"Most projects in organizations -- and virtually all
technology-related projects these days -- are about wicked problems.
Indeed, it is the social complexity of these problems, not their
technical complexity, that overwhelms most current problem solving and
project management approaches."
			"This text has a typo in referring to whicked problems" ];


	def __init__(self, *args):

		super(HighlighterTestCase, self).__init__(*args)
		self.foundList = []
		self.parser = QueryParser(self.FIELD_NAME, StandardAnalyzer())

	def testSimpleHighlighter(self):

		self.doSearching("wicked")
		formatter = TestFormatter(self)

		for i in range(0, self.hits.length()):
			self.foundList = []
			text = self.hits.doc(i).get(self.FIELD_NAME)
			tokenStream = self.analyzer.tokenStream(self.FIELD_NAME,
					StringReader(text))

			ctokenStream = CachingTokenFilter(tokenStream)
			highlighter = Highlighter(formatter, SpanScorer(self.query,
self.FIELD_NAME, ctokenStream))
			result = highlighter.getBestFragments(tokenStream, text,
					maxNumFragmentsRequired,
					"...")
			print "\t", result
			print "Found: ",
			print self.foundList

		

	def doSearching(self, queryString):

		searcher = IndexSearcher(self.ramDir)
		self.query = self.parser.parse(queryString)
		#self.query = PhraseQuery()
		#self.query = queries[0]
		#self.query.combine(queries)
		# for any multi-term queries to work (prefix, wildcard, range,
		# fuzzy etc) you must use a rewritten query!
		self.query = self.query.rewrite(self.reader)

		print "Searching for:", self.query.toString(self.FIELD_NAME)
		self.hits = searcher.search(self.query)
		self.numHighlights = 0

	def countHighlightTerm(self, found):

		self.foundList.append(found)

		self.numHighlights += 1 # update stats used in assertions

	def setUp(self):

		self.analyzer=StandardAnalyzer()
		self.ramDir = RAMDirectory()
		writer = IndexWriter(self.ramDir, self.analyzer, True)
		for text in self.texts:
			self.addDoc(writer, text)

		writer.optimize()
		writer.close()
		self.reader = IndexReader.open(self.ramDir)
		self.numHighlights = 0;

	def addDoc(self, writer, text):

		d = Document()
		f = Field(self.FIELD_NAME, text,
				Field.Store.YES, Field.Index.TOKENIZED,
				Field.TermVector.YES)
		d.add(f)
		writer.addDocument(d)


if __name__ == "__main__":
	import sys, lucene
	lucene.initVM(lucene.CLASSPATH)
	if '-loop' in sys.argv:
		sys.argv.remove('-loop')
		while True:
			try:
				main()
			except:
				pass
	else:
		main()

Re: SpanScorer Not implemented Error

Posted by Andi Vajda <va...@apache.org>.

On Tue, 28 Apr 2009, Max Lynch wrote:

>>
>> I noticed that in the lucene-java-2.4.1 svn release that was downloaded
>> when I built pylucene, there is no highlight folder anymore:
>> http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/search/
>>
>> From what I can tell though (From the build headers), pylucene thinks
>> SpanScorer is in spans and highlighter
>>
>>
>>
> Oh, I see that the highlighter package is a separate jar.
>
> Do you think there is an easy fix for the short term?  Or should I try a
> different method in pylucene?

I can only repeat what I just said:
   - please send me a piece of code to reproduce the problem
   - if it's indeed a name clash, no, I don't think it'd be too hard to
     enhance JCC to make it possible to rename name-clashing classes

Thanks !

Andi..

Re: SpanScorer Not implemented Error

Posted by Max Lynch <ih...@gmail.com>.

>
> I noticed that in the lucene-java-2.4.1 svn release that was downloaded
> when I built pylucene, there is no highlight folder anymore:
> http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/search/
>
> From what I can tell though (From the build headers), pylucene thinks
> SpanScorer is in spans and highlighter
>
>
>
Oh, I see that the highlighter package is a separate jar.

Do you think there is an easy fix for the short term?  Or should I try a
different method in pylucene?

Thanks,
Max

Re: SpanScorer Not implemented Error

Posted by Max Lynch <ih...@gmail.com>.

> If you are trying to use the highlighter package's SpanScorer class, there
> may be a problem with it clashing (by name) with the
> org.apache.lucene.search.spans.SpanScorer class:
>
>  >>> import lucene
>  >>> lucene.initVM(lucene.CLASSPATH)
>  >>> lucene.SpanScorer.class_
>  <Class: class org.apache.lucene.search.spans.SpanScorer>


I noticed that in the lucene-java-2.4.1 svn release that was downloaded when
I built pylucene, there is no highlight folder anymore:
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/search/

>From what I can tell though (From the build headers), pylucene thinks
SpanScorer is in spans and highlighter




>
>
> But without a specific example of what you're trying to do, it's mostly
> just guesswork here.
>
> Could you please include a piece of code that reproduces the problem ?
> Thanks !
>


        for i in range(0, self.hits.length()):
            text = self.hits.doc(i).get(self.FIELD_NAME)
            tokenStream = self.analyzer.tokenStream(self.FIELD_NAME,
                    StringReader(text))

            ctokenStream = CachingTokenFilter(tokenStream)
            highlighter = Highlighter(formatter, SpanScorer(self.query,
self.FIELD_NAME, ctokenStream))
            result = highlighter.getBestFragments(tokenStream, text,
                    maxNumFragmentsRequired,
                    "...")
            print "\t", result
            print "Found: ",


It's the test_Highlighter.py code but I've changed it to try to test the
SpanScorer class.

Thanks,
Max

Re: SpanScorer Not implemented Error

Posted by Andi Vajda <va...@apache.org>.

On Tue, 28 Apr 2009, Max Lynch wrote:

> I am trying to get a list of all terms that matched a document.  So, if I
> search for "John Smith", I want to know if I found "John Smith" specifically
> in the document.  I can use the lucene results but I need to do more
> processing based on exactly what was found.  I am using a highlighter and
> formatter for this, but if I use the QueryScorer it breaks up the phrase
> into "John" and "Smith", but only if the whole name was found.  I have
> uncovered that maybe the SpanScorer would preserve the whole phrase, but
> when I try to use it I get NotImplementedError.  Has it not been interfaced
> yet?  Is it a difficult thing to do?

If you are trying to use the highlighter package's SpanScorer class, there 
may be a problem with it clashing (by name) with the 
org.apache.lucene.search.spans.SpanScorer class:

   >>> import lucene
   >>> lucene.initVM(lucene.CLASSPATH)
   >>> lucene.SpanScorer.class_
   <Class: class org.apache.lucene.search.spans.SpanScorer>

But without a specific example of what you're trying to do, it's mostly 
just guesswork here.

If I guessed this right, enhancing JCC so that specific classes involved in 
a name clash can be renamed in Python (because java packages are flattened 
out in Python, yet not in the underlying generated C++) shouldn't be too 
hard.

Could you please include a piece of code that reproduces the problem ?
Thanks !

Andi..