You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Roman Chyla <ro...@gmail.com> on 2020/08/05 22:40:53 UTC

When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Hello devs,

I wanted to create an issue but the helpful message in red letters
reminded me to ask first.

While porting from lucene 6.x to 7x I'm struggling with a change that
was introduced in LUCENE-7626
(https://issues.apache.org/jira/browse/LUCENE-7626)

It is believed that zero offset tokens are bad bad - Mike McCandles
made the change which made me automatically doubt myself. I must be
wrong, hell, I was living in sin the past 5 years!

Sadly, we have been indexing and searching large volumes of data
without any corruption in index whatsover, but also without this new
change:

https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774

With that change, our multi-token synonyms house of cards is falling.

Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Recommended way to index multi-token synonyms appears to be this:
https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr

BUT, but! We don't want to place multi-token synonym into the same
position as the other words. We want to preserve their positions! We
want to preserve informaiton about offsets!

Here is an example:

* THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program

This is how it gets indexed

[(0, []),
(1, ['acr::hubble']),
(2, ['constant']),
(3, ['summary']),
(4, []),
(5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
(6, ['acr::space', 'space']),
(7, ['acr::telescope', 'telescope']),
(8, ['program']),

Notice the position 5 - multi-token synonym `syn::hubble space
telescope` token is on the first token which started the group
(emitted by Lucene's synonym filter). hst is another synonym; we also
index the 'hubble' word there.

 if you were to search for a phrase "HST program" it will be found
because our search parser will search for ("HST ? ? program" | "Hubble
Space Telescope program")

It simply found that by looking at synonyms: HST -> Hubble Space Telescope

And because of those funny 'syn::' prefixes, we don't suffer from the
other problem that Mike described -- "hst space" phrase search will
NOT find this paper (and that is a correct behaviour)

But all of this is possible only because lucene was indexing tokens
with offsets that can be lower than the last emitted token; for
example 'hubble space telescope' wil have offset 21-45; and the next
emitted token "space" will have offset 28-33

And it just works (lucene 6.x)

Here is another proof with the appropriate verbiage ("crazy"):

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618

Zero offsets have been working wonderfully for us so far. And I
actually cannot imagine how it can work without them - i.e. without
the ability to emit a token stream with offsets that are lower than
the last seen token.

I haven't tried SynonymFlatten filter, but because of this line in the
DefaultIndexingChain - I'm convinced the flatten symbol is not going
to do what we need (as seen in the example above)

https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915

What would you say? Is it a bug, is it not a bug but just some special
usecase? If it is a special usecase, what do we need to do? Plug in
our own indexing chain?

Thanks!

  -roman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Roman Chyla <ro...@gmail.com>.

Hi Mike,

Sorry for the delay, I was away last week. Now that I get back to it
again my plan is to write a test for the WordDelimiterFilter and
pinpoint the problem.

Cheers,

  Roman

On Thu, Aug 20, 2020 at 11:21 AM Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> Hi Roman,
>
> No need for anyone to be falling on swords here!  This is really complicated stuff, no worries.  And I think we have a compelling plan to move forwards so that we can index multi-token synonyms AND have 100% correct positional queries at search time, thanks to Michael Gibney's cool approach on https://issues.apache.org/jira/browse/LUCENE-4312.
>
> So it looks like WordDelimiterGraphFilter is producing buggy (out of order offsets) tokens here?
>
> Or are you running SynonymGraphFilter after WordDelimiterFilter?
>
> Looking at that failing example, it should have output'd that spacetime token immediately after the space token, not after the time token.
>
> Maybe use TokenStreamToDot to visualize what the heck token graph you are getting ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Aug 18, 2020 at 9:41 PM Roman Chyla <ro...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> I'm sorry, the problem all the time is inside related to a
>> word-delimiter filter factory. This is embarrassing but I have to
>> admit publicly and self-flagellate.
>>
>> A word-delimiter filter is used to split tokens, these then are used
>> to find multi-token synonyms (hence the connection). In my desire to
>> simplify, I have omitted that detail while writing my first email.
>>
>> I went to generate the stack trace:
>>
>> ```
>> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>>         "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE
>> TELESCOPE program"));```
>>
>> stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13
>> stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
>> stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM
>> offsetStart=38 offsetEnd=60
>> stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>
>> that worked, only the next one failed:
>>
>> ```assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>>         "title", "MIT and anti de sitter space-time"));```
>>
>>
>> stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13
>> stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
>> stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=syn::massachusetts institute of technology pos=0
>> type=SYNONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
>> stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
>> stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
>> stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
>> stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
>> stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
>> stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33
>>
>> ```325677 ERROR
>> (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40])
>> [    ] o.a.s.h.RequestHandlerBase
>> org.apache.solr.common.SolrException: Exception writing document id
>> 605 to the index; possible analysis error: startOffset must be
>> non-negative, and endOffset must be >= startOffset, and offsets must
>> not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
>> field 'title'
>> at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242)
>> at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
>> at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082)
>> at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
>> at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
>> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
>> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
>> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
>> at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125)
>> at org.apache.solr.util.TestHarness.update(TestHarness.java:285)
>> at org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274)
>> at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244)
>> at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874)
>> at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853)
>> at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847)
>> at org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
>> at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
>> at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
>> at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
>> at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
>> at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
>> at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
>> at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
>> at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
>> at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
>> at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
>> at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>> at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
>> at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
>> at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
>> at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
>> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.IllegalArgumentException: startOffset must be
>> non-negative, and endOffset must be >= startOffset, and offsets must
>> not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
>> field 'title'
>> at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823)
>> at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>> at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
>> at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
>> at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
>> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
>> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608)
>> at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
>> at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
>> at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
>> at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
>> ... 61 more
>> ```
>>
>> Embarrassingly Yours,
>>
>>   Roman
>>
>>
>>
>> On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> >
>> > Hi Roman,
>> >
>> > Can you share the full exception / stack trace that IndexWriter throws on that one *'d token in your first example?  I thought IndexWriter checks 1) startOffset >= last token's startOffset, and 2) endOffset >= startOffset for the current token.
>> >
>> > But you seem to be hitting an exception due to endOffset check across tokens, which I didn't remember/realize IW was enforcing.
>> >
>> > Could you share a small standalone test case showing the first example?  Maybe attach it to the issue (http://issues.apache.org/jira/browse/LUCENE-8776)?
>> >
>> > Thanks,
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <ro...@gmail.com> wrote:
>> >>
>> >> Hi Mike,
>> >>
>> >> Thanks for the question! And sorry for the delay, I haven't managed to
>> >> get to it yesterday. I have generated better output, marked with (*)
>> >> where it currently fails the first time and also included one extra
>> >> case to illustrate the PositionLength attribute.
>> >>
>> >> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>> >>         "title", "THE HUBBLE constant: a summary of the hubble space
>> >> telescope program"));
>> >>
>> >>
>> >> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
>> >> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
>> >> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
>> >> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
>> >> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
>> >> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
>> >> offsetStart=38 offsetEnd=60
>> >> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
>> >> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
>> >> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
>> >> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
>> >> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
>> >>
>> >> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
>> >> (the multi-token synonym emitted as a last token) it would fail as
>> >> well, because of the check for lastToken.beginOffset <
>> >> currentToken.beginOffset. Basically, any reordering would result in a
>> >> failure (unless offsets are trimmed).
>> >>
>> >>
>> >>
>> >> The following example has additional twist because of `space-time`;
>> >> the tokenizer first splits the word and generate two new tokens --
>> >> those alternative tokens are then used to find synonyms (space ==
>> >> universe)
>> >>
>> >> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>> >>         "title", "MIT and anti de sitter space-time"));
>> >>
>> >>
>> >> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
>> >> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
>> >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> >> term=syn::massachusetts institute of technology posInc=0 posLen=1
>> >> type=SYNONYM offsetStart=0 offsetEnd=3
>> >> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
>> >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> >> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
>> >> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
>> >> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
>> >> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
>> >> offsetStart=8 offsetEnd=28
>> >> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
>> >> offsetStart=8 offsetEnd=28
>> >> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
>> >> offsetStart=8 offsetEnd=28
>> >> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
>> >> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
>> >> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
>> >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
>> >> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
>> >> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33
>> >>
>> >> So far, all of these cases could be handled with the new position
>> >> length attribute. But let us look at a case where that would fail too.
>> >>
>> >> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
>> >>         "title", "Massachusetts Institute of Technology and
>> >> antidesitter space-time"));
>> >>
>> >>
>> >> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
>> >> term=syn::massachusetts institute of technology posInc=0 posLen=4
>> >> type=SYNONYM offsetStart=0 offsetEnd=36
>> >> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
>> >> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
>> >> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
>> >> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
>> >> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
>> >> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
>> >> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
>> >> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
>> >> offsetStart=41 offsetEnd=59
>> >> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
>> >> offsetStart=41 offsetEnd=59
>> >> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
>> >> offsetStart=41 offsetEnd=59
>> >> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
>> >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
>> >> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
>> >> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64
>> >>
>> >> Notice the posLen=4 of MIT; it would cover tokens `massachusetts
>> >> institute technology antidesitter` while offsets are still correct.
>> >>
>> >> This would, I think, affect not only highlighting, but also search
>> >> (which is, at least for us, more important). But I can imagine that in
>> >> more NLP-related domains, ability to identify the source of a
>> >> transformation could be more than a highlighting problem.
>> >>
>> >> Admittedly, most users would not care to notice, but it might be
>> >> important to some. Fundamentally, I think, the problem translates to
>> >> inability to reconstruct the DAG graph (under certain circumstances)
>> >> because of the lost pieces of information.
>> >>
>> >> ~roman
>> >>
>> >> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >> >
>> >> > Hi Roman,
>> >> >
>> >> > Sorry for the late reply!
>> >> >
>> >> > I think there remains substantial confusion about multi-token synonyms and IW's enforcement of offsets.  It really is worth thoroughly iterating/understanding your examples so we can get to the bottom of this.  It looks to me it is possible to emit tokens whose offsets do not go backwards and that properly model your example synonyms, so I do not yet see what the problem is.  Maybe I am being blind/tired ...
>> >> >
>> >> > What do you mean by pos=2, pos=0, etc.?  I think that is really the position increment?  Can you re-do the examples with posInc instead?  (Alternatively, you could keep "pos" but make it the absolute position, not the increment?).
>> >> >
>> >> > Could you also add posLength to each token?  This helps (me?) visualize the resulting graph, even though IW does not enforce it today.
>> >> >
>> >> > Looking at your first example, "THE HUBBLE constant: a summary of the hubble space telescope program", it looks to me like those tokens would all be accepted by IW's checks as they are?  startOffset never goes backwards, and for every token, endOffset >= startOffset.  Where in that first example does IW throw an exception?  Maybe insert a "** IW fails here" under the problematic token?  Or, maybe write a simple test case using e.g. CannedTokenStream?
>> >> >
>> >> > Your second example should also be fine, and not at all weird, but could you enumerate it into the specific tokens with posInc, posLength, start/end offset, "** IW fails here", etc., so we have a concrete example to discuss?
>> >> >
>> >> > Lucene's TokenStreams are really serializing a directed acyclic graph (DAG), in a specific order, one transition at a time.  Ironically/strangely, it is similar to the graph that git history maintains, and how "git log" then serializes that graph into an ordered series of transitions.  The simple int position in Lucene's TokenStream corresponds to git's githashes, to uniquely identify each "node", though, I do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a timestamp?
>> >> >
>> >> > Mike McCandless
>> >> >
>> >> > http://blog.mikemccandless.com
>> >> >
>> >> >
>> >> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:
>> >> >>
>> >> >> Hi Mike,
>> >> >>
>> >> >> Yes, they are not zero offsets - I was instinctively avoiding
>> >> >> "negative offsets"; but they are indeed backward offsets.
>> >> >>
>> >> >> Here is the token stream as produced by the analyzer chain indexing
>> >> >> "THE HUBBLE constant: a summary of the hubble space telescope program"
>> >> >>
>> >> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> >> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> >> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> >> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> >> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> >> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> >> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> >> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> >> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>> >> >>
>> >> >> Sometimes, we'll even have a situation when synonyms overlap: for
>> >> >> example "anti de sitter space time"
>> >> >>
>> >> >> "anti de sitter space time" -> "antidesitter space" (one token
>> >> >> spanning offsets 0-26; it gets emitted with the first token "anti"
>> >> >> right now)
>> >> >> "space time" -> "spacetime" (synonym 16-26)
>> >> >> "space" -> "universe" (25-26)
>> >> >>
>> >> >> Yes, weird, but useful if people want to search for `universe NEAR
>> >> >> anti` -- but another usecase which would be prohibited by the "new"
>> >> >> rule.
>> >> >>
>> >> >> DefaultIndexingChain checks new token offset against the last emitted
>> >> >> token, so I don't see a way to emit the multi-token synonym with
>> >> >> offsetts spanning multiple tokens if even one of these tokens was
>> >> >> already emitted. And the complement is equally true: if multi-token is
>> >> >> emitted as last of the group - it trips over `startOffset <
>> >> >> invertState.lastStartOffset`
>> >> >>
>> >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >> >>
>> >> >>
>> >> >>   -roman
>> >> >>
>> >> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> >> >> <lu...@mikemccandless.com> wrote:
>> >> >> >
>> >> >> > Hi Roman,
>> >> >> >
>> >> >> > Hmm, this is all very tricky!
>> >> >> >
>> >> >> > First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" that your analysis chain is trying to produce?
>> >> >> >
>> >> >> > Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go backwards" check in IndexWriter?  I thought IndexWriter is just checking that the startOffset for a token is not lower than the previous token's startOffset?  (And that the token's endOffset is not lower than its startOffset).
>> >> >> >
>> >> >> > So I am confused why your first example is tripping up on IW's offset checks.  Could you maybe redo the example, listing single token per line with the start/end offsets they are producing?
>> >> >> >
>> >> >> > Mike McCandless
>> >> >> >
>> >> >> > http://blog.mikemccandless.com
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:
>> >> >> >>
>> >> >> >> Hello devs,
>> >> >> >>
>> >> >> >> I wanted to create an issue but the helpful message in red letters
>> >> >> >> reminded me to ask first.
>> >> >> >>
>> >> >> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >> >> >> was introduced in LUCENE-7626
>> >> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >> >> >>
>> >> >> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >> >> >> made the change which made me automatically doubt myself. I must be
>> >> >> >> wrong, hell, I was living in sin the past 5 years!
>> >> >> >>
>> >> >> >> Sadly, we have been indexing and searching large volumes of data
>> >> >> >> without any corruption in index whatsover, but also without this new
>> >> >> >> change:
>> >> >> >>
>> >> >> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >> >> >>
>> >> >> >> With that change, our multi-token synonyms house of cards is falling.
>> >> >> >>
>> >> >> >> Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
>> >> >> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >> >> >>
>> >> >> >> Recommended way to index multi-token synonyms appears to be this:
>> >> >> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >> >> >>
>> >> >> >> BUT, but! We don't want to place multi-token synonym into the same
>> >> >> >> position as the other words. We want to preserve their positions! We
>> >> >> >> want to preserve informaiton about offsets!
>> >> >> >>
>> >> >> >> Here is an example:
>> >> >> >>
>> >> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >> >> >>
>> >> >> >> This is how it gets indexed
>> >> >> >>
>> >> >> >> [(0, []),
>> >> >> >> (1, ['acr::hubble']),
>> >> >> >> (2, ['constant']),
>> >> >> >> (3, ['summary']),
>> >> >> >> (4, []),
>> >> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> >> >> >> (6, ['acr::space', 'space']),
>> >> >> >> (7, ['acr::telescope', 'telescope']),
>> >> >> >> (8, ['program']),
>> >> >> >>
>> >> >> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >> >> >> telescope` token is on the first token which started the group
>> >> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >> >> >> index the 'hubble' word there.
>> >> >> >>
>> >> >> >>  if you were to search for a phrase "HST program" it will be found
>> >> >> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >> >> >> Space Telescope program")
>> >> >> >>
>> >> >> >> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>> >> >> >>
>> >> >> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >> >> >> other problem that Mike described -- "hst space" phrase search will
>> >> >> >> NOT find this paper (and that is a correct behaviour)
>> >> >> >>
>> >> >> >> But all of this is possible only because lucene was indexing tokens
>> >> >> >> with offsets that can be lower than the last emitted token; for
>> >> >> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >> >> >> emitted token "space" will have offset 28-33
>> >> >> >>
>> >> >> >> And it just works (lucene 6.x)
>> >> >> >>
>> >> >> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >> >> >>
>> >> >> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >> >> >>
>> >> >> >> Zero offsets have been working wonderfully for us so far. And I
>> >> >> >> actually cannot imagine how it can work without them - i.e. without
>> >> >> >> the ability to emit a token stream with offsets that are lower than
>> >> >> >> the last seen token.
>> >> >> >>
>> >> >> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >> >> >> to do what we need (as seen in the example above)
>> >> >> >>
>> >> >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >> >> >>
>> >> >> >> What would you say? Is it a bug, is it not a bug but just some special
>> >> >> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >> >> >> our own indexing chain?
>> >> >> >>
>> >> >> >> Thanks!
>> >> >> >>
>> >> >> >>   -roman
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Roman,

No need for anyone to be falling on swords here!  This is really
complicated stuff, no worries.  And I think we have a compelling plan to
move forwards so that we can index multi-token synonyms AND have 100%
correct positional queries at search time, thanks to Michael Gibney's cool
approach on https://issues.apache.org/jira/browse/LUCENE-4312.

So it looks like WordDelimiterGraphFilter is producing buggy (out of order
offsets) tokens here?

Or are you running SynonymGraphFilter after WordDelimiterFilter?

Looking at that failing example, it should have output'd that spacetime
token immediately after the space token, not after the time token.

Maybe use TokenStreamToDot to visualize what the heck token graph you are
getting ...

Mike McCandless

http://blog.mikemccandless.com


On Tue, Aug 18, 2020 at 9:41 PM Roman Chyla <ro...@gmail.com> wrote:

> Hi Mike,
>
> I'm sorry, the problem all the time is inside related to a
> word-delimiter filter factory. This is embarrassing but I have to
> admit publicly and self-flagellate.
>
> A word-delimiter filter is used to split tokens, these then are used
> to find multi-token synonyms (hence the connection). In my desire to
> simplify, I have omitted that detail while writing my first email.
>
> I went to generate the stack trace:
>
> ```
> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>         "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE
> TELESCOPE program"));```
>
> stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13
> stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
> stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
> stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4
> offsetEnd=10
> stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM
> offsetStart=38 offsetEnd=60
> stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
> stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68
>
> that worked, only the next one failed:
>
> ```assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>         "title", "MIT and anti de sitter space-time"));```
>
>
> stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13
> stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
> stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
> stage:indexer term=syn::massachusetts institute of technology pos=0
> type=SYNONYM offsetStart=0 offsetEnd=3
> stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
> stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
> stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
> stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM
> offsetStart=8 offsetEnd=28
> stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM
> offsetStart=8 offsetEnd=28
> stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
> stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
> stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
> stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
> stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33
>
> ```325677 ERROR
> (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40])
> [    ] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: Exception writing document id
> 605 to the index; possible analysis error: startOffset must be
> non-negative, and endOffset must be >= startOffset, and offsets must
> not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
> field 'title'
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082)
> at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
> at
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
> at
> org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125)
> at org.apache.solr.util.TestHarness.update(TestHarness.java:285)
> at
> org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274)
> at
> org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244)
> at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874)
> at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853)
> at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847)
> at
> org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
> at
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
> at
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
> at
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at
> org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: startOffset must be
> non-negative, and endOffset must be >= startOffset, and offsets must
> not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
> field 'title'
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823)
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608)
> at
> org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
> at
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
> ... 61 more
> ```
>
> Embarrassingly Yours,
>
>   Roman
>
>
>
> On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless
> <lu...@mikemccandless.com> wrote:
> >
> > Hi Roman,
> >
> > Can you share the full exception / stack trace that IndexWriter throws
> on that one *'d token in your first example?  I thought IndexWriter checks
> 1) startOffset >= last token's startOffset, and 2) endOffset >= startOffset
> for the current token.
> >
> > But you seem to be hitting an exception due to endOffset check across
> tokens, which I didn't remember/realize IW was enforcing.
> >
> > Could you share a small standalone test case showing the first example?
> Maybe attach it to the issue (
> http://issues.apache.org/jira/browse/LUCENE-8776)?
> >
> > Thanks,
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <ro...@gmail.com>
> wrote:
> >>
> >> Hi Mike,
> >>
> >> Thanks for the question! And sorry for the delay, I haven't managed to
> >> get to it yesterday. I have generated better output, marked with (*)
> >> where it currently fails the first time and also included one extra
> >> case to illustrate the PositionLength attribute.
> >>
> >> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
> >>         "title", "THE HUBBLE constant: a summary of the hubble space
> >> telescope program"));
> >>
> >>
> >> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
> >> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4
> offsetEnd=10
> >> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
> >> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
> >> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
> >> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
> >> offsetStart=38 offsetEnd=60
> >> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
> >> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
> >> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
> >> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
> >> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
> >>
> >> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
> >> (the multi-token synonym emitted as a last token) it would fail as
> >> well, because of the check for lastToken.beginOffset <
> >> currentToken.beginOffset. Basically, any reordering would result in a
> >> failure (unless offsets are trimmed).
> >>
> >>
> >>
> >> The following example has additional twist because of `space-time`;
> >> the tokenizer first splits the word and generate two new tokens --
> >> those alternative tokens are then used to find synonyms (space ==
> >> universe)
> >>
> >> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
> >>         "title", "MIT and anti de sitter space-time"));
> >>
> >>
> >> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0
> offsetEnd=13
> >> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
> >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
> >> term=syn::massachusetts institute of technology posInc=0 posLen=1
> >> type=SYNONYM offsetStart=0 offsetEnd=3
> >> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
> >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
> >> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
> >> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
> >> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
> >> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
> >> offsetStart=8 offsetEnd=28
> >> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
> >> offsetStart=8 offsetEnd=28
> >> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
> >> offsetStart=8 offsetEnd=28
> >> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
> >> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
> >> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
> >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23
> offsetEnd=28
> >> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
> >> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33
> >>
> >> So far, all of these cases could be handled with the new position
> >> length attribute. But let us look at a case where that would fail too.
> >>
> >> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
> >>         "title", "Massachusetts Institute of Technology and
> >> antidesitter space-time"));
> >>
> >>
> >> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0
> offsetEnd=12
> >> term=syn::massachusetts institute of technology posInc=0 posLen=4
> >> type=SYNONYM offsetStart=0 offsetEnd=36
> >> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
> >> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
> >> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
> >> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
> >> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41
> offsetEnd=53
> >> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
> >> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
> >> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
> >> offsetStart=41 offsetEnd=59
> >> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
> >> offsetStart=41 offsetEnd=59
> >> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
> >> offsetStart=41 offsetEnd=59
> >> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
> >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54
> offsetEnd=59
> >> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
> >> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64
> >>
> >> Notice the posLen=4 of MIT; it would cover tokens `massachusetts
> >> institute technology antidesitter` while offsets are still correct.
> >>
> >> This would, I think, affect not only highlighting, but also search
> >> (which is, at least for us, more important). But I can imagine that in
> >> more NLP-related domains, ability to identify the source of a
> >> transformation could be more than a highlighting problem.
> >>
> >> Admittedly, most users would not care to notice, but it might be
> >> important to some. Fundamentally, I think, the problem translates to
> >> inability to reconstruct the DAG graph (under certain circumstances)
> >> because of the lost pieces of information.
> >>
> >> ~roman
> >>
> >> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
> >> <lu...@mikemccandless.com> wrote:
> >> >
> >> > Hi Roman,
> >> >
> >> > Sorry for the late reply!
> >> >
> >> > I think there remains substantial confusion about multi-token
> synonyms and IW's enforcement of offsets.  It really is worth thoroughly
> iterating/understanding your examples so we can get to the bottom of this.
> It looks to me it is possible to emit tokens whose offsets do not go
> backwards and that properly model your example synonyms, so I do not yet
> see what the problem is.  Maybe I am being blind/tired ...
> >> >
> >> > What do you mean by pos=2, pos=0, etc.?  I think that is really the
> position increment?  Can you re-do the examples with posInc instead?
> (Alternatively, you could keep "pos" but make it the absolute position, not
> the increment?).
> >> >
> >> > Could you also add posLength to each token?  This helps (me?)
> visualize the resulting graph, even though IW does not enforce it today.
> >> >
> >> > Looking at your first example, "THE HUBBLE constant: a summary of the
> hubble space telescope program", it looks to me like those tokens would all
> be accepted by IW's checks as they are?  startOffset never goes backwards,
> and for every token, endOffset >= startOffset.  Where in that first example
> does IW throw an exception?  Maybe insert a "** IW fails here" under the
> problematic token?  Or, maybe write a simple test case using e.g.
> CannedTokenStream?
> >> >
> >> > Your second example should also be fine, and not at all weird, but
> could you enumerate it into the specific tokens with posInc, posLength,
> start/end offset, "** IW fails here", etc., so we have a concrete example
> to discuss?
> >> >
> >> > Lucene's TokenStreams are really serializing a directed acyclic graph
> (DAG), in a specific order, one transition at a time.
> Ironically/strangely, it is similar to the graph that git history
> maintains, and how "git log" then serializes that graph into an ordered
> series of transitions.  The simple int position in Lucene's TokenStream
> corresponds to git's githashes, to uniquely identify each "node", though, I
> do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a
> timestamp?
> >> >
> >> > Mike McCandless
> >> >
> >> > http://blog.mikemccandless.com
> >> >
> >> >
> >> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com>
> wrote:
> >> >>
> >> >> Hi Mike,
> >> >>
> >> >> Yes, they are not zero offsets - I was instinctively avoiding
> >> >> "negative offsets"; but they are indeed backward offsets.
> >> >>
> >> >> Here is the token stream as produced by the analyzer chain indexing
> >> >> "THE HUBBLE constant: a summary of the hubble space telescope
> program"
> >> >>
> >> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> >> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> >> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> >> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> >> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> >> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> >> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> >> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> >> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> >> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> >> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68
> >> >>
> >> >> Sometimes, we'll even have a situation when synonyms overlap: for
> >> >> example "anti de sitter space time"
> >> >>
> >> >> "anti de sitter space time" -> "antidesitter space" (one token
> >> >> spanning offsets 0-26; it gets emitted with the first token "anti"
> >> >> right now)
> >> >> "space time" -> "spacetime" (synonym 16-26)
> >> >> "space" -> "universe" (25-26)
> >> >>
> >> >> Yes, weird, but useful if people want to search for `universe NEAR
> >> >> anti` -- but another usecase which would be prohibited by the "new"
> >> >> rule.
> >> >>
> >> >> DefaultIndexingChain checks new token offset against the last emitted
> >> >> token, so I don't see a way to emit the multi-token synonym with
> >> >> offsetts spanning multiple tokens if even one of these tokens was
> >> >> already emitted. And the complement is equally true: if multi-token
> is
> >> >> emitted as last of the group - it trips over `startOffset <
> >> >> invertState.lastStartOffset`
> >> >>
> >> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >> >>
> >> >>
> >> >>   -roman
> >> >>
> >> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >
> >> >> > Hi Roman,
> >> >> >
> >> >> > Hmm, this is all very tricky!
> >> >> >
> >> >> > First off, why do you call this "zero offsets"?  Isn't it
> "backwards offsets" that your analysis chain is trying to produce?
> >> >> >
> >> >> > Second, in your first example, if you output the tokens in the
> right order, they would not violate the "offsets do not go backwards" check
> in IndexWriter?  I thought IndexWriter is just checking that the
> startOffset for a token is not lower than the previous token's
> startOffset?  (And that the token's endOffset is not lower than its
> startOffset).
> >> >> >
> >> >> > So I am confused why your first example is tripping up on IW's
> offset checks.  Could you maybe redo the example, listing single token per
> line with the start/end offsets they are producing?
> >> >> >
> >> >> > Mike McCandless
> >> >> >
> >> >> > http://blog.mikemccandless.com
> >> >> >
> >> >> >
> >> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com>
> wrote:
> >> >> >>
> >> >> >> Hello devs,
> >> >> >>
> >> >> >> I wanted to create an issue but the helpful message in red letters
> >> >> >> reminded me to ask first.
> >> >> >>
> >> >> >> While porting from lucene 6.x to 7x I'm struggling with a change
> that
> >> >> >> was introduced in LUCENE-7626
> >> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >> >> >>
> >> >> >> It is believed that zero offset tokens are bad bad - Mike
> McCandles
> >> >> >> made the change which made me automatically doubt myself. I must
> be
> >> >> >> wrong, hell, I was living in sin the past 5 years!
> >> >> >>
> >> >> >> Sadly, we have been indexing and searching large volumes of data
> >> >> >> without any corruption in index whatsover, but also without this
> new
> >> >> >> change:
> >> >> >>
> >> >> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >> >> >>
> >> >> >> With that change, our multi-token synonyms house of cards is
> falling.
> >> >> >>
> >> >> >> Mike has this wonderful blogpost explaining troubles with
> multi-token synonyms:
> >> >> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >> >> >>
> >> >> >> Recommended way to index multi-token synonyms appears to be this:
> >> >> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >> >> >>
> >> >> >> BUT, but! We don't want to place multi-token synonym into the same
> >> >> >> position as the other words. We want to preserve their positions!
> We
> >> >> >> want to preserve informaiton about offsets!
> >> >> >>
> >> >> >> Here is an example:
> >> >> >>
> >> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE
> program
> >> >> >>
> >> >> >> This is how it gets indexed
> >> >> >>
> >> >> >> [(0, []),
> >> >> >> (1, ['acr::hubble']),
> >> >> >> (2, ['constant']),
> >> >> >> (3, ['summary']),
> >> >> >> (4, []),
> >> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >> >> >> (6, ['acr::space', 'space']),
> >> >> >> (7, ['acr::telescope', 'telescope']),
> >> >> >> (8, ['program']),
> >> >> >>
> >> >> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >> >> >> telescope` token is on the first token which started the group
> >> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we
> also
> >> >> >> index the 'hubble' word there.
> >> >> >>
> >> >> >>  if you were to search for a phrase "HST program" it will be found
> >> >> >> because our search parser will search for ("HST ? ? program" |
> "Hubble
> >> >> >> Space Telescope program")
> >> >> >>
> >> >> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >> >> >>
> >> >> >> And because of those funny 'syn::' prefixes, we don't suffer from
> the
> >> >> >> other problem that Mike described -- "hst space" phrase search
> will
> >> >> >> NOT find this paper (and that is a correct behaviour)
> >> >> >>
> >> >> >> But all of this is possible only because lucene was indexing
> tokens
> >> >> >> with offsets that can be lower than the last emitted token; for
> >> >> >> example 'hubble space telescope' wil have offset 21-45; and the
> next
> >> >> >> emitted token "space" will have offset 28-33
> >> >> >>
> >> >> >> And it just works (lucene 6.x)
> >> >> >>
> >> >> >> Here is another proof with the appropriate verbiage ("crazy"):
> >> >> >>
> >> >> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >> >> >>
> >> >> >> Zero offsets have been working wonderfully for us so far. And I
> >> >> >> actually cannot imagine how it can work without them - i.e.
> without
> >> >> >> the ability to emit a token stream with offsets that are lower
> than
> >> >> >> the last seen token.
> >> >> >>
> >> >> >> I haven't tried SynonymFlatten filter, but because of this line
> in the
> >> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not
> going
> >> >> >> to do what we need (as seen in the example above)
> >> >> >>
> >> >> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >> >> >>
> >> >> >> What would you say? Is it a bug, is it not a bug but just some
> special
> >> >> >> usecase? If it is a special usecase, what do we need to do? Plug
> in
> >> >> >> our own indexing chain?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >>   -roman
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Roman Chyla <ro...@gmail.com>.

Hi Mike,

I'm sorry, the problem all the time is inside related to a
word-delimiter filter factory. This is embarrassing but I have to
admit publicly and self-flagellate.

A word-delimiter filter is used to split tokens, these then are used
to find multi-token synonyms (hence the connection). In my desire to
simplify, I have omitted that detail while writing my first email.

I went to generate the stack trace:

```
assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
        "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE
TELESCOPE program"));```

stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM
offsetStart=38 offsetEnd=60
stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68

that worked, only the next one failed:

```assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
        "title", "MIT and anti de sitter space-time"));```


stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::massachusetts institute of technology pos=0
type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM
offsetStart=8 offsetEnd=28
stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM
offsetStart=8 offsetEnd=28
stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33

```325677 ERROR
(TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40])
[    ] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id
605 to the index; possible analysis error: startOffset must be
non-negative, and endOffset must be >= startOffset, and offsets must
not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
field 'title'
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125)
at org.apache.solr.util.TestHarness.update(TestHarness.java:285)
at org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274)
at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244)
at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847)
at org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: startOffset must be
non-negative, and endOffset must be >= startOffset, and offsets must
not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for
field 'title'
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
... 61 more
```

Embarrassingly Yours,

  Roman



On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> Hi Roman,
>
> Can you share the full exception / stack trace that IndexWriter throws on that one *'d token in your first example?  I thought IndexWriter checks 1) startOffset >= last token's startOffset, and 2) endOffset >= startOffset for the current token.
>
> But you seem to be hitting an exception due to endOffset check across tokens, which I didn't remember/realize IW was enforcing.
>
> Could you share a small standalone test case showing the first example?  Maybe attach it to the issue (http://issues.apache.org/jira/browse/LUCENE-8776)?
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <ro...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Thanks for the question! And sorry for the delay, I haven't managed to
>> get to it yesterday. I have generated better output, marked with (*)
>> where it currently fails the first time and also included one extra
>> case to illustrate the PositionLength attribute.
>>
>> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>>         "title", "THE HUBBLE constant: a summary of the hubble space
>> telescope program"));
>>
>>
>> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
>> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
>> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
>> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
>> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
>> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
>> offsetStart=38 offsetEnd=60
>> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
>> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
>> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
>> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
>>
>> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
>> (the multi-token synonym emitted as a last token) it would fail as
>> well, because of the check for lastToken.beginOffset <
>> currentToken.beginOffset. Basically, any reordering would result in a
>> failure (unless offsets are trimmed).
>>
>>
>>
>> The following example has additional twist because of `space-time`;
>> the tokenizer first splits the word and generate two new tokens --
>> those alternative tokens are then used to find synonyms (space ==
>> universe)
>>
>> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>>         "title", "MIT and anti de sitter space-time"));
>>
>>
>> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
>> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
>> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> term=syn::massachusetts institute of technology posInc=0 posLen=1
>> type=SYNONYM offsetStart=0 offsetEnd=3
>> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
>> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
>> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
>> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
>> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
>> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
>> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
>> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
>> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
>> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33
>>
>> So far, all of these cases could be handled with the new position
>> length attribute. But let us look at a case where that would fail too.
>>
>> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
>>         "title", "Massachusetts Institute of Technology and
>> antidesitter space-time"));
>>
>>
>> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
>> term=syn::massachusetts institute of technology posInc=0 posLen=4
>> type=SYNONYM offsetStart=0 offsetEnd=36
>> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
>> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
>> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
>> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
>> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
>> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
>> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
>> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
>> offsetStart=41 offsetEnd=59
>> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
>> offsetStart=41 offsetEnd=59
>> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
>> offsetStart=41 offsetEnd=59
>> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
>> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
>> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
>> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64
>>
>> Notice the posLen=4 of MIT; it would cover tokens `massachusetts
>> institute technology antidesitter` while offsets are still correct.
>>
>> This would, I think, affect not only highlighting, but also search
>> (which is, at least for us, more important). But I can imagine that in
>> more NLP-related domains, ability to identify the source of a
>> transformation could be more than a highlighting problem.
>>
>> Admittedly, most users would not care to notice, but it might be
>> important to some. Fundamentally, I think, the problem translates to
>> inability to reconstruct the DAG graph (under certain circumstances)
>> because of the lost pieces of information.
>>
>> ~roman
>>
>> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> >
>> > Hi Roman,
>> >
>> > Sorry for the late reply!
>> >
>> > I think there remains substantial confusion about multi-token synonyms and IW's enforcement of offsets.  It really is worth thoroughly iterating/understanding your examples so we can get to the bottom of this.  It looks to me it is possible to emit tokens whose offsets do not go backwards and that properly model your example synonyms, so I do not yet see what the problem is.  Maybe I am being blind/tired ...
>> >
>> > What do you mean by pos=2, pos=0, etc.?  I think that is really the position increment?  Can you re-do the examples with posInc instead?  (Alternatively, you could keep "pos" but make it the absolute position, not the increment?).
>> >
>> > Could you also add posLength to each token?  This helps (me?) visualize the resulting graph, even though IW does not enforce it today.
>> >
>> > Looking at your first example, "THE HUBBLE constant: a summary of the hubble space telescope program", it looks to me like those tokens would all be accepted by IW's checks as they are?  startOffset never goes backwards, and for every token, endOffset >= startOffset.  Where in that first example does IW throw an exception?  Maybe insert a "** IW fails here" under the problematic token?  Or, maybe write a simple test case using e.g. CannedTokenStream?
>> >
>> > Your second example should also be fine, and not at all weird, but could you enumerate it into the specific tokens with posInc, posLength, start/end offset, "** IW fails here", etc., so we have a concrete example to discuss?
>> >
>> > Lucene's TokenStreams are really serializing a directed acyclic graph (DAG), in a specific order, one transition at a time.  Ironically/strangely, it is similar to the graph that git history maintains, and how "git log" then serializes that graph into an ordered series of transitions.  The simple int position in Lucene's TokenStream corresponds to git's githashes, to uniquely identify each "node", though, I do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a timestamp?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:
>> >>
>> >> Hi Mike,
>> >>
>> >> Yes, they are not zero offsets - I was instinctively avoiding
>> >> "negative offsets"; but they are indeed backward offsets.
>> >>
>> >> Here is the token stream as produced by the analyzer chain indexing
>> >> "THE HUBBLE constant: a summary of the hubble space telescope program"
>> >>
>> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>> >>
>> >> Sometimes, we'll even have a situation when synonyms overlap: for
>> >> example "anti de sitter space time"
>> >>
>> >> "anti de sitter space time" -> "antidesitter space" (one token
>> >> spanning offsets 0-26; it gets emitted with the first token "anti"
>> >> right now)
>> >> "space time" -> "spacetime" (synonym 16-26)
>> >> "space" -> "universe" (25-26)
>> >>
>> >> Yes, weird, but useful if people want to search for `universe NEAR
>> >> anti` -- but another usecase which would be prohibited by the "new"
>> >> rule.
>> >>
>> >> DefaultIndexingChain checks new token offset against the last emitted
>> >> token, so I don't see a way to emit the multi-token synonym with
>> >> offsetts spanning multiple tokens if even one of these tokens was
>> >> already emitted. And the complement is equally true: if multi-token is
>> >> emitted as last of the group - it trips over `startOffset <
>> >> invertState.lastStartOffset`
>> >>
>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>
>> >>
>> >>   -roman
>> >>
>> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >> >
>> >> > Hi Roman,
>> >> >
>> >> > Hmm, this is all very tricky!
>> >> >
>> >> > First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" that your analysis chain is trying to produce?
>> >> >
>> >> > Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go backwards" check in IndexWriter?  I thought IndexWriter is just checking that the startOffset for a token is not lower than the previous token's startOffset?  (And that the token's endOffset is not lower than its startOffset).
>> >> >
>> >> > So I am confused why your first example is tripping up on IW's offset checks.  Could you maybe redo the example, listing single token per line with the start/end offsets they are producing?
>> >> >
>> >> > Mike McCandless
>> >> >
>> >> > http://blog.mikemccandless.com
>> >> >
>> >> >
>> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:
>> >> >>
>> >> >> Hello devs,
>> >> >>
>> >> >> I wanted to create an issue but the helpful message in red letters
>> >> >> reminded me to ask first.
>> >> >>
>> >> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >> >> was introduced in LUCENE-7626
>> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >> >>
>> >> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >> >> made the change which made me automatically doubt myself. I must be
>> >> >> wrong, hell, I was living in sin the past 5 years!
>> >> >>
>> >> >> Sadly, we have been indexing and searching large volumes of data
>> >> >> without any corruption in index whatsover, but also without this new
>> >> >> change:
>> >> >>
>> >> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >> >>
>> >> >> With that change, our multi-token synonyms house of cards is falling.
>> >> >>
>> >> >> Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
>> >> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >> >>
>> >> >> Recommended way to index multi-token synonyms appears to be this:
>> >> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >> >>
>> >> >> BUT, but! We don't want to place multi-token synonym into the same
>> >> >> position as the other words. We want to preserve their positions! We
>> >> >> want to preserve informaiton about offsets!
>> >> >>
>> >> >> Here is an example:
>> >> >>
>> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >> >>
>> >> >> This is how it gets indexed
>> >> >>
>> >> >> [(0, []),
>> >> >> (1, ['acr::hubble']),
>> >> >> (2, ['constant']),
>> >> >> (3, ['summary']),
>> >> >> (4, []),
>> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> >> >> (6, ['acr::space', 'space']),
>> >> >> (7, ['acr::telescope', 'telescope']),
>> >> >> (8, ['program']),
>> >> >>
>> >> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >> >> telescope` token is on the first token which started the group
>> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >> >> index the 'hubble' word there.
>> >> >>
>> >> >>  if you were to search for a phrase "HST program" it will be found
>> >> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >> >> Space Telescope program")
>> >> >>
>> >> >> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>> >> >>
>> >> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >> >> other problem that Mike described -- "hst space" phrase search will
>> >> >> NOT find this paper (and that is a correct behaviour)
>> >> >>
>> >> >> But all of this is possible only because lucene was indexing tokens
>> >> >> with offsets that can be lower than the last emitted token; for
>> >> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >> >> emitted token "space" will have offset 28-33
>> >> >>
>> >> >> And it just works (lucene 6.x)
>> >> >>
>> >> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >> >>
>> >> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >> >>
>> >> >> Zero offsets have been working wonderfully for us so far. And I
>> >> >> actually cannot imagine how it can work without them - i.e. without
>> >> >> the ability to emit a token stream with offsets that are lower than
>> >> >> the last seen token.
>> >> >>
>> >> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >> >> to do what we need (as seen in the example above)
>> >> >>
>> >> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >> >>
>> >> >> What would you say? Is it a bug, is it not a bug but just some special
>> >> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >> >> our own indexing chain?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>   -roman
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Roman,

Can you share the full exception / stack trace that IndexWriter throws on
that one *'d token in your first example?  I thought IndexWriter checks 1)
startOffset >= last token's startOffset, and 2) endOffset >= startOffset
for the current token.

But you seem to be hitting an exception due to endOffset check across
tokens, which I didn't remember/realize IW was enforcing.

Could you share a small standalone test case showing the first example?
Maybe attach it to the issue (
http://issues.apache.org/jira/browse/LUCENE-8776)?

Thanks,

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla <ro...@gmail.com> wrote:

> Hi Mike,
>
> Thanks for the question! And sorry for the delay, I haven't managed to
> get to it yesterday. I have generated better output, marked with (*)
> where it currently fails the first time and also included one extra
> case to illustrate the PositionLength attribute.
>
> assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
>         "title", "THE HUBBLE constant: a summary of the hubble space
> telescope program"));
>
>
> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
> offsetStart=38 offsetEnd=60
> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
>
> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
> (the multi-token synonym emitted as a last token) it would fail as
> well, because of the check for lastToken.beginOffset <
> currentToken.beginOffset. Basically, any reordering would result in a
> failure (unless offsets are trimmed).
>
>
>
> The following example has additional twist because of `space-time`;
> the tokenizer first splits the word and generate two new tokens --
> those alternative tokens are then used to find synonyms (space ==
> universe)
>
> assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
>         "title", "MIT and anti de sitter space-time"));
>
>
> term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
> term=syn::massachusetts institute of technology posInc=0 posLen=1
> type=SYNONYM offsetStart=0 offsetEnd=3
> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
> offsetStart=8 offsetEnd=28
> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
> offsetStart=8 offsetEnd=28
> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
> offsetStart=8 offsetEnd=28
> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23
> offsetEnd=28
> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
> term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33
>
> So far, all of these cases could be handled with the new position
> length attribute. But let us look at a case where that would fail too.
>
> assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
>         "title", "Massachusetts Institute of Technology and
> antidesitter space-time"));
>
>
> term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
> term=syn::massachusetts institute of technology posInc=0 posLen=4
> type=SYNONYM offsetStart=0 offsetEnd=36
> term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
> term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
> term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
> term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
> term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
> term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
> term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
> term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
> offsetStart=41 offsetEnd=59
> term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
> offsetStart=41 offsetEnd=59
> term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
> offsetStart=41 offsetEnd=59
> term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54
> offsetEnd=59
> term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
> term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64
>
> Notice the posLen=4 of MIT; it would cover tokens `massachusetts
> institute technology antidesitter` while offsets are still correct.
>
> This would, I think, affect not only highlighting, but also search
> (which is, at least for us, more important). But I can imagine that in
> more NLP-related domains, ability to identify the source of a
> transformation could be more than a highlighting problem.
>
> Admittedly, most users would not care to notice, but it might be
> important to some. Fundamentally, I think, the problem translates to
> inability to reconstruct the DAG graph (under certain circumstances)
> because of the lost pieces of information.
>
> ~roman
>
> On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
> <lu...@mikemccandless.com> wrote:
> >
> > Hi Roman,
> >
> > Sorry for the late reply!
> >
> > I think there remains substantial confusion about multi-token synonyms
> and IW's enforcement of offsets.  It really is worth thoroughly
> iterating/understanding your examples so we can get to the bottom of this.
> It looks to me it is possible to emit tokens whose offsets do not go
> backwards and that properly model your example synonyms, so I do not yet
> see what the problem is.  Maybe I am being blind/tired ...
> >
> > What do you mean by pos=2, pos=0, etc.?  I think that is really the
> position increment?  Can you re-do the examples with posInc instead?
> (Alternatively, you could keep "pos" but make it the absolute position, not
> the increment?).
> >
> > Could you also add posLength to each token?  This helps (me?) visualize
> the resulting graph, even though IW does not enforce it today.
> >
> > Looking at your first example, "THE HUBBLE constant: a summary of the
> hubble space telescope program", it looks to me like those tokens would all
> be accepted by IW's checks as they are?  startOffset never goes backwards,
> and for every token, endOffset >= startOffset.  Where in that first example
> does IW throw an exception?  Maybe insert a "** IW fails here" under the
> problematic token?  Or, maybe write a simple test case using e.g.
> CannedTokenStream?
> >
> > Your second example should also be fine, and not at all weird, but could
> you enumerate it into the specific tokens with posInc, posLength, start/end
> offset, "** IW fails here", etc., so we have a concrete example to discuss?
> >
> > Lucene's TokenStreams are really serializing a directed acyclic graph
> (DAG), in a specific order, one transition at a time.
> Ironically/strangely, it is similar to the graph that git history
> maintains, and how "git log" then serializes that graph into an ordered
> series of transitions.  The simple int position in Lucene's TokenStream
> corresponds to git's githashes, to uniquely identify each "node", though, I
> do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a
> timestamp?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com>
> wrote:
> >>
> >> Hi Mike,
> >>
> >> Yes, they are not zero offsets - I was instinctively avoiding
> >> "negative offsets"; but they are indeed backward offsets.
> >>
> >> Here is the token stream as produced by the analyzer chain indexing
> >> "THE HUBBLE constant: a summary of the hubble space telescope program"
> >>
> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> >> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> >> term=program pos=1 type=word offsetStart=61 offsetEnd=68
> >>
> >> Sometimes, we'll even have a situation when synonyms overlap: for
> >> example "anti de sitter space time"
> >>
> >> "anti de sitter space time" -> "antidesitter space" (one token
> >> spanning offsets 0-26; it gets emitted with the first token "anti"
> >> right now)
> >> "space time" -> "spacetime" (synonym 16-26)
> >> "space" -> "universe" (25-26)
> >>
> >> Yes, weird, but useful if people want to search for `universe NEAR
> >> anti` -- but another usecase which would be prohibited by the "new"
> >> rule.
> >>
> >> DefaultIndexingChain checks new token offset against the last emitted
> >> token, so I don't see a way to emit the multi-token synonym with
> >> offsetts spanning multiple tokens if even one of these tokens was
> >> already emitted. And the complement is equally true: if multi-token is
> >> emitted as last of the group - it trips over `startOffset <
> >> invertState.lastStartOffset`
> >>
> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>
> >>
> >>   -roman
> >>
> >> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> >> <lu...@mikemccandless.com> wrote:
> >> >
> >> > Hi Roman,
> >> >
> >> > Hmm, this is all very tricky!
> >> >
> >> > First off, why do you call this "zero offsets"?  Isn't it "backwards
> offsets" that your analysis chain is trying to produce?
> >> >
> >> > Second, in your first example, if you output the tokens in the right
> order, they would not violate the "offsets do not go backwards" check in
> IndexWriter?  I thought IndexWriter is just checking that the startOffset
> for a token is not lower than the previous token's startOffset?  (And that
> the token's endOffset is not lower than its startOffset).
> >> >
> >> > So I am confused why your first example is tripping up on IW's offset
> checks.  Could you maybe redo the example, listing single token per line
> with the start/end offsets they are producing?
> >> >
> >> > Mike McCandless
> >> >
> >> > http://blog.mikemccandless.com
> >> >
> >> >
> >> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com>
> wrote:
> >> >>
> >> >> Hello devs,
> >> >>
> >> >> I wanted to create an issue but the helpful message in red letters
> >> >> reminded me to ask first.
> >> >>
> >> >> While porting from lucene 6.x to 7x I'm struggling with a change that
> >> >> was introduced in LUCENE-7626
> >> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >> >>
> >> >> It is believed that zero offset tokens are bad bad - Mike McCandles
> >> >> made the change which made me automatically doubt myself. I must be
> >> >> wrong, hell, I was living in sin the past 5 years!
> >> >>
> >> >> Sadly, we have been indexing and searching large volumes of data
> >> >> without any corruption in index whatsover, but also without this new
> >> >> change:
> >> >>
> >> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >> >>
> >> >> With that change, our multi-token synonyms house of cards is falling.
> >> >>
> >> >> Mike has this wonderful blogpost explaining troubles with
> multi-token synonyms:
> >> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >> >>
> >> >> Recommended way to index multi-token synonyms appears to be this:
> >> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >> >>
> >> >> BUT, but! We don't want to place multi-token synonym into the same
> >> >> position as the other words. We want to preserve their positions! We
> >> >> want to preserve informaiton about offsets!
> >> >>
> >> >> Here is an example:
> >> >>
> >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE
> program
> >> >>
> >> >> This is how it gets indexed
> >> >>
> >> >> [(0, []),
> >> >> (1, ['acr::hubble']),
> >> >> (2, ['constant']),
> >> >> (3, ['summary']),
> >> >> (4, []),
> >> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >> >> (6, ['acr::space', 'space']),
> >> >> (7, ['acr::telescope', 'telescope']),
> >> >> (8, ['program']),
> >> >>
> >> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >> >> telescope` token is on the first token which started the group
> >> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
> >> >> index the 'hubble' word there.
> >> >>
> >> >>  if you were to search for a phrase "HST program" it will be found
> >> >> because our search parser will search for ("HST ? ? program" |
> "Hubble
> >> >> Space Telescope program")
> >> >>
> >> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >> >>
> >> >> And because of those funny 'syn::' prefixes, we don't suffer from the
> >> >> other problem that Mike described -- "hst space" phrase search will
> >> >> NOT find this paper (and that is a correct behaviour)
> >> >>
> >> >> But all of this is possible only because lucene was indexing tokens
> >> >> with offsets that can be lower than the last emitted token; for
> >> >> example 'hubble space telescope' wil have offset 21-45; and the next
> >> >> emitted token "space" will have offset 28-33
> >> >>
> >> >> And it just works (lucene 6.x)
> >> >>
> >> >> Here is another proof with the appropriate verbiage ("crazy"):
> >> >>
> >> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >> >>
> >> >> Zero offsets have been working wonderfully for us so far. And I
> >> >> actually cannot imagine how it can work without them - i.e. without
> >> >> the ability to emit a token stream with offsets that are lower than
> >> >> the last seen token.
> >> >>
> >> >> I haven't tried SynonymFlatten filter, but because of this line in
> the
> >> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> >> >> to do what we need (as seen in the example above)
> >> >>
> >> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >> >>
> >> >> What would you say? Is it a bug, is it not a bug but just some
> special
> >> >> usecase? If it is a special usecase, what do we need to do? Plug in
> >> >> our own indexing chain?
> >> >>
> >> >> Thanks!
> >> >>
> >> >>   -roman
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Roman Chyla <ro...@gmail.com>.

Hi Mike,

Thanks for the question! And sorry for the delay, I haven't managed to
get to it yesterday. I have generated better output, marked with (*)
where it currently fails the first time and also included one extra
case to illustrate the PositionLength attribute.

assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
        "title", "THE HUBBLE constant: a summary of the hubble space
telescope program"));


term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
offsetStart=38 offsetEnd=60
term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
* term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68

* - fails because of offsetEnd < lastToken.offsetEnd; If reordered
(the multi-token synonym emitted as a last token) it would fail as
well, because of the check for lastToken.beginOffset <
currentToken.beginOffset. Basically, any reordering would result in a
failure (unless offsets are trimmed).



The following example has additional twist because of `space-time`;
the tokenizer first splits the word and generate two new tokens --
those alternative tokens are then used to find synonyms (space ==
universe)

assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
        "title", "MIT and anti de sitter space-time"));


term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
term=syn::massachusetts institute of technology posInc=0 posLen=1
type=SYNONYM offsetStart=0 offsetEnd=3
term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
offsetStart=8 offsetEnd=28
term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
offsetStart=8 offsetEnd=28
term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
offsetStart=8 offsetEnd=28
* term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33

So far, all of these cases could be handled with the new position
length attribute. But let us look at a case where that would fail too.

assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
        "title", "Massachusetts Institute of Technology and
antidesitter space-time"));


term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
term=syn::massachusetts institute of technology posInc=0 posLen=4
type=SYNONYM offsetStart=0 offsetEnd=36
term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM
offsetStart=41 offsetEnd=59
term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM
offsetStart=41 offsetEnd=59
term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM
offsetStart=41 offsetEnd=59
term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64

Notice the posLen=4 of MIT; it would cover tokens `massachusetts
institute technology antidesitter` while offsets are still correct.

This would, I think, affect not only highlighting, but also search
(which is, at least for us, more important). But I can imagine that in
more NLP-related domains, ability to identify the source of a
transformation could be more than a highlighting problem.

Admittedly, most users would not care to notice, but it might be
important to some. Fundamentally, I think, the problem translates to
inability to reconstruct the DAG graph (under certain circumstances)
because of the lost pieces of information.

~roman

On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> Hi Roman,
>
> Sorry for the late reply!
>
> I think there remains substantial confusion about multi-token synonyms and IW's enforcement of offsets.  It really is worth thoroughly iterating/understanding your examples so we can get to the bottom of this.  It looks to me it is possible to emit tokens whose offsets do not go backwards and that properly model your example synonyms, so I do not yet see what the problem is.  Maybe I am being blind/tired ...
>
> What do you mean by pos=2, pos=0, etc.?  I think that is really the position increment?  Can you re-do the examples with posInc instead?  (Alternatively, you could keep "pos" but make it the absolute position, not the increment?).
>
> Could you also add posLength to each token?  This helps (me?) visualize the resulting graph, even though IW does not enforce it today.
>
> Looking at your first example, "THE HUBBLE constant: a summary of the hubble space telescope program", it looks to me like those tokens would all be accepted by IW's checks as they are?  startOffset never goes backwards, and for every token, endOffset >= startOffset.  Where in that first example does IW throw an exception?  Maybe insert a "** IW fails here" under the problematic token?  Or, maybe write a simple test case using e.g. CannedTokenStream?
>
> Your second example should also be fine, and not at all weird, but could you enumerate it into the specific tokens with posInc, posLength, start/end offset, "** IW fails here", etc., so we have a concrete example to discuss?
>
> Lucene's TokenStreams are really serializing a directed acyclic graph (DAG), in a specific order, one transition at a time.  Ironically/strangely, it is similar to the graph that git history maintains, and how "git log" then serializes that graph into an ordered series of transitions.  The simple int position in Lucene's TokenStream corresponds to git's githashes, to uniquely identify each "node", though, I do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a timestamp?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Yes, they are not zero offsets - I was instinctively avoiding
>> "negative offsets"; but they are indeed backward offsets.
>>
>> Here is the token stream as produced by the analyzer chain indexing
>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>>
>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>
>> Sometimes, we'll even have a situation when synonyms overlap: for
>> example "anti de sitter space time"
>>
>> "anti de sitter space time" -> "antidesitter space" (one token
>> spanning offsets 0-26; it gets emitted with the first token "anti"
>> right now)
>> "space time" -> "spacetime" (synonym 16-26)
>> "space" -> "universe" (25-26)
>>
>> Yes, weird, but useful if people want to search for `universe NEAR
>> anti` -- but another usecase which would be prohibited by the "new"
>> rule.
>>
>> DefaultIndexingChain checks new token offset against the last emitted
>> token, so I don't see a way to emit the multi-token synonym with
>> offsetts spanning multiple tokens if even one of these tokens was
>> already emitted. And the complement is equally true: if multi-token is
>> emitted as last of the group - it trips over `startOffset <
>> invertState.lastStartOffset`
>>
>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>
>>
>>   -roman
>>
>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> >
>> > Hi Roman,
>> >
>> > Hmm, this is all very tricky!
>> >
>> > First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" that your analysis chain is trying to produce?
>> >
>> > Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go backwards" check in IndexWriter?  I thought IndexWriter is just checking that the startOffset for a token is not lower than the previous token's startOffset?  (And that the token's endOffset is not lower than its startOffset).
>> >
>> > So I am confused why your first example is tripping up on IW's offset checks.  Could you maybe redo the example, listing single token per line with the start/end offsets they are producing?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:
>> >>
>> >> Hello devs,
>> >>
>> >> I wanted to create an issue but the helpful message in red letters
>> >> reminded me to ask first.
>> >>
>> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >> was introduced in LUCENE-7626
>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >>
>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >> made the change which made me automatically doubt myself. I must be
>> >> wrong, hell, I was living in sin the past 5 years!
>> >>
>> >> Sadly, we have been indexing and searching large volumes of data
>> >> without any corruption in index whatsover, but also without this new
>> >> change:
>> >>
>> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >>
>> >> With that change, our multi-token synonyms house of cards is falling.
>> >>
>> >> Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
>> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >>
>> >> Recommended way to index multi-token synonyms appears to be this:
>> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >>
>> >> BUT, but! We don't want to place multi-token synonym into the same
>> >> position as the other words. We want to preserve their positions! We
>> >> want to preserve informaiton about offsets!
>> >>
>> >> Here is an example:
>> >>
>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >>
>> >> This is how it gets indexed
>> >>
>> >> [(0, []),
>> >> (1, ['acr::hubble']),
>> >> (2, ['constant']),
>> >> (3, ['summary']),
>> >> (4, []),
>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> >> (6, ['acr::space', 'space']),
>> >> (7, ['acr::telescope', 'telescope']),
>> >> (8, ['program']),
>> >>
>> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >> telescope` token is on the first token which started the group
>> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >> index the 'hubble' word there.
>> >>
>> >>  if you were to search for a phrase "HST program" it will be found
>> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >> Space Telescope program")
>> >>
>> >> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>> >>
>> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >> other problem that Mike described -- "hst space" phrase search will
>> >> NOT find this paper (and that is a correct behaviour)
>> >>
>> >> But all of this is possible only because lucene was indexing tokens
>> >> with offsets that can be lower than the last emitted token; for
>> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >> emitted token "space" will have offset 28-33
>> >>
>> >> And it just works (lucene 6.x)
>> >>
>> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >>
>> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >>
>> >> Zero offsets have been working wonderfully for us so far. And I
>> >> actually cannot imagine how it can work without them - i.e. without
>> >> the ability to emit a token stream with offsets that are lower than
>> >> the last seen token.
>> >>
>> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >> to do what we need (as seen in the example above)
>> >>
>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>
>> >> What would you say? Is it a bug, is it not a bug but just some special
>> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >> our own indexing chain?
>> >>
>> >> Thanks!
>> >>
>> >>   -roman
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Roman,

Sorry for the late reply!

I think there remains substantial confusion about multi-token synonyms and
IW's enforcement of offsets.  It really is worth thoroughly
iterating/understanding your examples so we can get to the bottom of this.
It looks to me it is possible to emit tokens whose offsets do not go
backwards and that properly model your example synonyms, so I do not yet
see what the problem is.  Maybe I am being blind/tired ...

What do you mean by pos=2, pos=0, etc.?  I think that is really the
position increment?  Can you re-do the examples with posInc instead?
(Alternatively, you could keep "pos" but make it the absolute position, not
the increment?).

Could you also add posLength to each token?  This helps (me?) visualize the
resulting graph, even though IW does not enforce it today.

Looking at your first example, "THE HUBBLE constant: a summary of the
hubble space telescope program", it looks to me like those tokens would all
be accepted by IW's checks as they are?  startOffset never goes backwards,
and for every token, endOffset >= startOffset.  Where in that first example
does IW throw an exception?  Maybe insert a "** IW fails here" under the
problematic token?  Or, maybe write a simple test case using e.g.
CannedTokenStream?

Your second example should also be fine, and not at all weird, but could
you enumerate it into the specific tokens with posInc, posLength, start/end
offset, "** IW fails here", etc., so we have a concrete example to discuss?

Lucene's TokenStreams are really serializing a directed acyclic graph
(DAG), in a specific order, one transition at a time.
Ironically/strangely, it is similar to the graph that git history
maintains, and how "git log" then serializes that graph into an ordered
series of transitions.  The simple int position in Lucene's TokenStream
corresponds to git's githashes, to uniquely identify each "node", though, I
do not think there is an analog in git to Lucene's offsets.  Hmm, maybe a
timestamp?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:

> Hi Mike,
>
> Yes, they are not zero offsets - I was instinctively avoiding
> "negative offsets"; but they are indeed backward offsets.
>
> Here is the token stream as produced by the analyzer chain indexing
> "THE HUBBLE constant: a summary of the hubble space telescope program"
>
> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>
> Sometimes, we'll even have a situation when synonyms overlap: for
> example "anti de sitter space time"
>
> "anti de sitter space time" -> "antidesitter space" (one token
> spanning offsets 0-26; it gets emitted with the first token "anti"
> right now)
> "space time" -> "spacetime" (synonym 16-26)
> "space" -> "universe" (25-26)
>
> Yes, weird, but useful if people want to search for `universe NEAR
> anti` -- but another usecase which would be prohibited by the "new"
> rule.
>
> DefaultIndexingChain checks new token offset against the last emitted
> token, so I don't see a way to emit the multi-token synonym with
> offsetts spanning multiple tokens if even one of these tokens was
> already emitted. And the complement is equally true: if multi-token is
> emitted as last of the group - it trips over `startOffset <
> invertState.lastStartOffset`
>
>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>
>
>   -roman
>
> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> <lu...@mikemccandless.com> wrote:
> >
> > Hi Roman,
> >
> > Hmm, this is all very tricky!
> >
> > First off, why do you call this "zero offsets"?  Isn't it "backwards
> offsets" that your analysis chain is trying to produce?
> >
> > Second, in your first example, if you output the tokens in the right
> order, they would not violate the "offsets do not go backwards" check in
> IndexWriter?  I thought IndexWriter is just checking that the startOffset
> for a token is not lower than the previous token's startOffset?  (And that
> the token's endOffset is not lower than its startOffset).
> >
> > So I am confused why your first example is tripping up on IW's offset
> checks.  Could you maybe redo the example, listing single token per line
> with the start/end offsets they are producing?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com>
> wrote:
> >>
> >> Hello devs,
> >>
> >> I wanted to create an issue but the helpful message in red letters
> >> reminded me to ask first.
> >>
> >> While porting from lucene 6.x to 7x I'm struggling with a change that
> >> was introduced in LUCENE-7626
> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >>
> >> It is believed that zero offset tokens are bad bad - Mike McCandles
> >> made the change which made me automatically doubt myself. I must be
> >> wrong, hell, I was living in sin the past 5 years!
> >>
> >> Sadly, we have been indexing and searching large volumes of data
> >> without any corruption in index whatsover, but also without this new
> >> change:
> >>
> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >>
> >> With that change, our multi-token synonyms house of cards is falling.
> >>
> >> Mike has this wonderful blogpost explaining troubles with multi-token
> synonyms:
> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>
> >> Recommended way to index multi-token synonyms appears to be this:
> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >>
> >> BUT, but! We don't want to place multi-token synonym into the same
> >> position as the other words. We want to preserve their positions! We
> >> want to preserve informaiton about offsets!
> >>
> >> Here is an example:
> >>
> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
> >>
> >> This is how it gets indexed
> >>
> >> [(0, []),
> >> (1, ['acr::hubble']),
> >> (2, ['constant']),
> >> (3, ['summary']),
> >> (4, []),
> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >> (6, ['acr::space', 'space']),
> >> (7, ['acr::telescope', 'telescope']),
> >> (8, ['program']),
> >>
> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >> telescope` token is on the first token which started the group
> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
> >> index the 'hubble' word there.
> >>
> >>  if you were to search for a phrase "HST program" it will be found
> >> because our search parser will search for ("HST ? ? program" | "Hubble
> >> Space Telescope program")
> >>
> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >>
> >> And because of those funny 'syn::' prefixes, we don't suffer from the
> >> other problem that Mike described -- "hst space" phrase search will
> >> NOT find this paper (and that is a correct behaviour)
> >>
> >> But all of this is possible only because lucene was indexing tokens
> >> with offsets that can be lower than the last emitted token; for
> >> example 'hubble space telescope' wil have offset 21-45; and the next
> >> emitted token "space" will have offset 28-33
> >>
> >> And it just works (lucene 6.x)
> >>
> >> Here is another proof with the appropriate verbiage ("crazy"):
> >>
> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >>
> >> Zero offsets have been working wonderfully for us so far. And I
> >> actually cannot imagine how it can work without them - i.e. without
> >> the ability to emit a token stream with offsets that are lower than
> >> the last seen token.
> >>
> >> I haven't tried SynonymFlatten filter, but because of this line in the
> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> >> to do what we need (as seen in the example above)
> >>
> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>
> >> What would you say? Is it a bug, is it not a bug but just some special
> >> usecase? If it is a special usecase, what do we need to do? Plug in
> >> our own indexing chain?
> >>
> >> Thanks!
> >>
> >>   -roman
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Roman Chyla <ro...@gmail.com>.

oh,thanks! that saves everybody some time. I have commented in there,
pleading to be allowed to do something - if that proposal sounds even
little bit reasonable, please consider amplifying the signal

On Mon, Aug 10, 2020 at 4:22 PM David Smiley <ds...@apache.org> wrote:
>
> There already is one: https://issues.apache.org/jira/browse/LUCENE-8776
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla <ro...@gmail.com> wrote:
>>
>> I'll have to somehow find a solution for this situation, giving up
>> offsets seems like too big a price to pay, I see that overriding
>> DefaultIndexingChain is not exactly easy -- the only thing I can think
>> of is to just trick the classloader into giving it a different version
>> of the chain (praying this can be done without compromising security,
>> I have not followed JDK evolutions for some time...) - aside from
>> forking lucene and editing that; which I decidedly don't want to do
>> (monkey-patching it, ok, i can live with that... :-))
>>
>> It *seems* to me that the original reason for negative offset checks
>> stemmed from the fact that vint could have been written (and possibly
>> vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738
>>
>> but the underlying issue and some of the patches seem to have been
>> addressing those problems; but a much shorter version of the patch was
>> committed -- despite the perf results not being indicative (i.e. it
>> could have been good with the longer patch) -- but to really
>> understand it, one would have to spend more than 10mins reading the
>> comments
>>
>> Further to the point, I think negative offsets can be produced only on
>> the very first token, unless there is a bug in a filter (there was/is
>> a separate check for that in 6x and perhaps it is still there in 7x).
>> That would be much less restrictive than the current condition which
>> disallows all backward offsets. We never ran into an index corruption
>> in lucene 4-6x, so I really wonder if the "forbid all backwards
>> offsets" approach might be too restrictive.
>>
>> Looks like I should create an issue...
>>
>> On Thu, Aug 6, 2020 at 11:28 AM Gus Heck <gu...@gmail.com> wrote:
>> >
>> > I've had a nearly identical experience to what Dave describes, I also chafe under this restriction.
>> >
>> > On Thu, Aug 6, 2020 at 11:07 AM David Smiley <ds...@apache.org> wrote:
>> >>
>> >> I sympathize with your pain, Roman.
>> >>
>> >> It appears we can't really do index-time multi-word synonyms because of the offset ordering rule.  But it's not just synonyms, it's other forms of multi-token expansion.  Where I work, I've seen an interesting approach to mixed language text analysis in which a sophisticated Tokenizer effectively re-tokenizes an input multiple ways by producing a token stream that is a concatenation of different interpretations of the input.  On a Lucene upgrade, we had to "coarsen" the offsets to the point of having highlights that point to a whole sentence instead of the words in that sentence :-(.  I need to do something to fix this; I'm trying hard to resist modifying our Lucene fork for this constraint.  Maybe instead of concatenating, it might be interleaved / overlapped but the interpretations aren't necessarily aligned to make this possible without risking breaking position-sensitive queries.
>> >>
>> >> So... I'm not a fan of this constraint on offsets.
>> >>
>> >> ~ David Smiley
>> >> Apache Lucene/Solr Search Developer
>> >> http://www.linkedin.com/in/davidwsmiley
>> >>
>> >>
>> >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:
>> >>>
>> >>> Hi Mike,
>> >>>
>> >>> Yes, they are not zero offsets - I was instinctively avoiding
>> >>> "negative offsets"; but they are indeed backward offsets.
>> >>>
>> >>> Here is the token stream as produced by the analyzer chain indexing
>> >>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>> >>>
>> >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> >>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> >>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> >>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> >>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> >>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> >>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> >>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>> >>>
>> >>> Sometimes, we'll even have a situation when synonyms overlap: for
>> >>> example "anti de sitter space time"
>> >>>
>> >>> "anti de sitter space time" -> "antidesitter space" (one token
>> >>> spanning offsets 0-26; it gets emitted with the first token "anti"
>> >>> right now)
>> >>> "space time" -> "spacetime" (synonym 16-26)
>> >>> "space" -> "universe" (25-26)
>> >>>
>> >>> Yes, weird, but useful if people want to search for `universe NEAR
>> >>> anti` -- but another usecase which would be prohibited by the "new"
>> >>> rule.
>> >>>
>> >>> DefaultIndexingChain checks new token offset against the last emitted
>> >>> token, so I don't see a way to emit the multi-token synonym with
>> >>> offsetts spanning multiple tokens if even one of these tokens was
>> >>> already emitted. And the complement is equally true: if multi-token is
>> >>> emitted as last of the group - it trips over `startOffset <
>> >>> invertState.lastStartOffset`
>> >>>
>> >>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>>
>> >>>
>> >>>   -roman
>> >>>
>> >>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> >>> <lu...@mikemccandless.com> wrote:
>> >>> >
>> >>> > Hi Roman,
>> >>> >
>> >>> > Hmm, this is all very tricky!
>> >>> >
>> >>> > First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" that your analysis chain is trying to produce?
>> >>> >
>> >>> > Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go backwards" check in IndexWriter?  I thought IndexWriter is just checking that the startOffset for a token is not lower than the previous token's startOffset?  (And that the token's endOffset is not lower than its startOffset).
>> >>> >
>> >>> > So I am confused why your first example is tripping up on IW's offset checks.  Could you maybe redo the example, listing single token per line with the start/end offsets they are producing?
>> >>> >
>> >>> > Mike McCandless
>> >>> >
>> >>> > http://blog.mikemccandless.com
>> >>> >
>> >>> >
>> >>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:
>> >>> >>
>> >>> >> Hello devs,
>> >>> >>
>> >>> >> I wanted to create an issue but the helpful message in red letters
>> >>> >> reminded me to ask first.
>> >>> >>
>> >>> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >>> >> was introduced in LUCENE-7626
>> >>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >>> >>
>> >>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >>> >> made the change which made me automatically doubt myself. I must be
>> >>> >> wrong, hell, I was living in sin the past 5 years!
>> >>> >>
>> >>> >> Sadly, we have been indexing and searching large volumes of data
>> >>> >> without any corruption in index whatsover, but also without this new
>> >>> >> change:
>> >>> >>
>> >>> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >>> >>
>> >>> >> With that change, our multi-token synonyms house of cards is falling.
>> >>> >>
>> >>> >> Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
>> >>> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >>> >>
>> >>> >> Recommended way to index multi-token synonyms appears to be this:
>> >>> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >>> >>
>> >>> >> BUT, but! We don't want to place multi-token synonym into the same
>> >>> >> position as the other words. We want to preserve their positions! We
>> >>> >> want to preserve informaiton about offsets!
>> >>> >>
>> >>> >> Here is an example:
>> >>> >>
>> >>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >>> >>
>> >>> >> This is how it gets indexed
>> >>> >>
>> >>> >> [(0, []),
>> >>> >> (1, ['acr::hubble']),
>> >>> >> (2, ['constant']),
>> >>> >> (3, ['summary']),
>> >>> >> (4, []),
>> >>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> >>> >> (6, ['acr::space', 'space']),
>> >>> >> (7, ['acr::telescope', 'telescope']),
>> >>> >> (8, ['program']),
>> >>> >>
>> >>> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >>> >> telescope` token is on the first token which started the group
>> >>> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >>> >> index the 'hubble' word there.
>> >>> >>
>> >>> >>  if you were to search for a phrase "HST program" it will be found
>> >>> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >>> >> Space Telescope program")
>> >>> >>
>> >>> >> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>> >>> >>
>> >>> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >>> >> other problem that Mike described -- "hst space" phrase search will
>> >>> >> NOT find this paper (and that is a correct behaviour)
>> >>> >>
>> >>> >> But all of this is possible only because lucene was indexing tokens
>> >>> >> with offsets that can be lower than the last emitted token; for
>> >>> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >>> >> emitted token "space" will have offset 28-33
>> >>> >>
>> >>> >> And it just works (lucene 6.x)
>> >>> >>
>> >>> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >>> >>
>> >>> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >>> >>
>> >>> >> Zero offsets have been working wonderfully for us so far. And I
>> >>> >> actually cannot imagine how it can work without them - i.e. without
>> >>> >> the ability to emit a token stream with offsets that are lower than
>> >>> >> the last seen token.
>> >>> >>
>> >>> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >>> >> to do what we need (as seen in the example above)
>> >>> >>
>> >>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>> >>
>> >>> >> What would you say? Is it a bug, is it not a bug but just some special
>> >>> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >>> >> our own indexing chain?
>> >>> >>
>> >>> >> Thanks!
>> >>> >>
>> >>> >>   -roman
>> >>> >>
>> >>> >> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>> >>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >
>> >
>> > --
>> > http://www.needhamsoftware.com (work)
>> > http://www.the111shift.com (play)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by David Smiley <ds...@apache.org>.

There already is one: https://issues.apache.org/jira/browse/LUCENE-8776

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla <ro...@gmail.com> wrote:

> I'll have to somehow find a solution for this situation, giving up
> offsets seems like too big a price to pay, I see that overriding
> DefaultIndexingChain is not exactly easy -- the only thing I can think
> of is to just trick the classloader into giving it a different version
> of the chain (praying this can be done without compromising security,
> I have not followed JDK evolutions for some time...) - aside from
> forking lucene and editing that; which I decidedly don't want to do
> (monkey-patching it, ok, i can live with that... :-))
>
> It *seems* to me that the original reason for negative offset checks
> stemmed from the fact that vint could have been written (and possibly
> vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738
>
> but the underlying issue and some of the patches seem to have been
> addressing those problems; but a much shorter version of the patch was
> committed -- despite the perf results not being indicative (i.e. it
> could have been good with the longer patch) -- but to really
> understand it, one would have to spend more than 10mins reading the
> comments
>
> Further to the point, I think negative offsets can be produced only on
> the very first token, unless there is a bug in a filter (there was/is
> a separate check for that in 6x and perhaps it is still there in 7x).
> That would be much less restrictive than the current condition which
> disallows all backward offsets. We never ran into an index corruption
> in lucene 4-6x, so I really wonder if the "forbid all backwards
> offsets" approach might be too restrictive.
>
> Looks like I should create an issue...
>
> On Thu, Aug 6, 2020 at 11:28 AM Gus Heck <gu...@gmail.com> wrote:
> >
> > I've had a nearly identical experience to what Dave describes, I also
> chafe under this restriction.
> >
> > On Thu, Aug 6, 2020 at 11:07 AM David Smiley <ds...@apache.org> wrote:
> >>
> >> I sympathize with your pain, Roman.
> >>
> >> It appears we can't really do index-time multi-word synonyms because of
> the offset ordering rule.  But it's not just synonyms, it's other forms of
> multi-token expansion.  Where I work, I've seen an interesting approach to
> mixed language text analysis in which a sophisticated Tokenizer effectively
> re-tokenizes an input multiple ways by producing a token stream that is a
> concatenation of different interpretations of the input.  On a Lucene
> upgrade, we had to "coarsen" the offsets to the point of having highlights
> that point to a whole sentence instead of the words in that sentence :-(.
> I need to do something to fix this; I'm trying hard to resist modifying our
> Lucene fork for this constraint.  Maybe instead of concatenating, it might
> be interleaved / overlapped but the interpretations aren't necessarily
> aligned to make this possible without risking breaking position-sensitive
> queries.
> >>
> >> So... I'm not a fan of this constraint on offsets.
> >>
> >> ~ David Smiley
> >> Apache Lucene/Solr Search Developer
> >> http://www.linkedin.com/in/davidwsmiley
> >>
> >>
> >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com>
> wrote:
> >>>
> >>> Hi Mike,
> >>>
> >>> Yes, they are not zero offsets - I was instinctively avoiding
> >>> "negative offsets"; but they are indeed backward offsets.
> >>>
> >>> Here is the token stream as produced by the analyzer chain indexing
> >>> "THE HUBBLE constant: a summary of the hubble space telescope program"
> >>>
> >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> >>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> >>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> >>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> >>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> >>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> >>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> >>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> >>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
> >>>
> >>> Sometimes, we'll even have a situation when synonyms overlap: for
> >>> example "anti de sitter space time"
> >>>
> >>> "anti de sitter space time" -> "antidesitter space" (one token
> >>> spanning offsets 0-26; it gets emitted with the first token "anti"
> >>> right now)
> >>> "space time" -> "spacetime" (synonym 16-26)
> >>> "space" -> "universe" (25-26)
> >>>
> >>> Yes, weird, but useful if people want to search for `universe NEAR
> >>> anti` -- but another usecase which would be prohibited by the "new"
> >>> rule.
> >>>
> >>> DefaultIndexingChain checks new token offset against the last emitted
> >>> token, so I don't see a way to emit the multi-token synonym with
> >>> offsetts spanning multiple tokens if even one of these tokens was
> >>> already emitted. And the complement is equally true: if multi-token is
> >>> emitted as last of the group - it trips over `startOffset <
> >>> invertState.lastStartOffset`
> >>>
> >>>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>>
> >>>
> >>>   -roman
> >>>
> >>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> >>> <lu...@mikemccandless.com> wrote:
> >>> >
> >>> > Hi Roman,
> >>> >
> >>> > Hmm, this is all very tricky!
> >>> >
> >>> > First off, why do you call this "zero offsets"?  Isn't it "backwards
> offsets" that your analysis chain is trying to produce?
> >>> >
> >>> > Second, in your first example, if you output the tokens in the right
> order, they would not violate the "offsets do not go backwards" check in
> IndexWriter?  I thought IndexWriter is just checking that the startOffset
> for a token is not lower than the previous token's startOffset?  (And that
> the token's endOffset is not lower than its startOffset).
> >>> >
> >>> > So I am confused why your first example is tripping up on IW's
> offset checks.  Could you maybe redo the example, listing single token per
> line with the start/end offsets they are producing?
> >>> >
> >>> > Mike McCandless
> >>> >
> >>> > http://blog.mikemccandless.com
> >>> >
> >>> >
> >>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com>
> wrote:
> >>> >>
> >>> >> Hello devs,
> >>> >>
> >>> >> I wanted to create an issue but the helpful message in red letters
> >>> >> reminded me to ask first.
> >>> >>
> >>> >> While porting from lucene 6.x to 7x I'm struggling with a change
> that
> >>> >> was introduced in LUCENE-7626
> >>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >>> >>
> >>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
> >>> >> made the change which made me automatically doubt myself. I must be
> >>> >> wrong, hell, I was living in sin the past 5 years!
> >>> >>
> >>> >> Sadly, we have been indexing and searching large volumes of data
> >>> >> without any corruption in index whatsover, but also without this new
> >>> >> change:
> >>> >>
> >>> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >>> >>
> >>> >> With that change, our multi-token synonyms house of cards is
> falling.
> >>> >>
> >>> >> Mike has this wonderful blogpost explaining troubles with
> multi-token synonyms:
> >>> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>> >>
> >>> >> Recommended way to index multi-token synonyms appears to be this:
> >>> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >>> >>
> >>> >> BUT, but! We don't want to place multi-token synonym into the same
> >>> >> position as the other words. We want to preserve their positions! We
> >>> >> want to preserve informaiton about offsets!
> >>> >>
> >>> >> Here is an example:
> >>> >>
> >>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE
> program
> >>> >>
> >>> >> This is how it gets indexed
> >>> >>
> >>> >> [(0, []),
> >>> >> (1, ['acr::hubble']),
> >>> >> (2, ['constant']),
> >>> >> (3, ['summary']),
> >>> >> (4, []),
> >>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >>> >> (6, ['acr::space', 'space']),
> >>> >> (7, ['acr::telescope', 'telescope']),
> >>> >> (8, ['program']),
> >>> >>
> >>> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >>> >> telescope` token is on the first token which started the group
> >>> >> (emitted by Lucene's synonym filter). hst is another synonym; we
> also
> >>> >> index the 'hubble' word there.
> >>> >>
> >>> >>  if you were to search for a phrase "HST program" it will be found
> >>> >> because our search parser will search for ("HST ? ? program" |
> "Hubble
> >>> >> Space Telescope program")
> >>> >>
> >>> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >>> >>
> >>> >> And because of those funny 'syn::' prefixes, we don't suffer from
> the
> >>> >> other problem that Mike described -- "hst space" phrase search will
> >>> >> NOT find this paper (and that is a correct behaviour)
> >>> >>
> >>> >> But all of this is possible only because lucene was indexing tokens
> >>> >> with offsets that can be lower than the last emitted token; for
> >>> >> example 'hubble space telescope' wil have offset 21-45; and the next
> >>> >> emitted token "space" will have offset 28-33
> >>> >>
> >>> >> And it just works (lucene 6.x)
> >>> >>
> >>> >> Here is another proof with the appropriate verbiage ("crazy"):
> >>> >>
> >>> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >>> >>
> >>> >> Zero offsets have been working wonderfully for us so far. And I
> >>> >> actually cannot imagine how it can work without them - i.e. without
> >>> >> the ability to emit a token stream with offsets that are lower than
> >>> >> the last seen token.
> >>> >>
> >>> >> I haven't tried SynonymFlatten filter, but because of this line in
> the
> >>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> >>> >> to do what we need (as seen in the example above)
> >>> >>
> >>> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>> >>
> >>> >> What would you say? Is it a bug, is it not a bug but just some
> special
> >>> >> usecase? If it is a special usecase, what do we need to do? Plug in
> >>> >> our own indexing chain?
> >>> >>
> >>> >> Thanks!
> >>> >>
> >>> >>   -roman
> >>> >>
> >>> >>
> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Roman Chyla <ro...@gmail.com>.

I'll have to somehow find a solution for this situation, giving up
offsets seems like too big a price to pay, I see that overriding
DefaultIndexingChain is not exactly easy -- the only thing I can think
of is to just trick the classloader into giving it a different version
of the chain (praying this can be done without compromising security,
I have not followed JDK evolutions for some time...) - aside from
forking lucene and editing that; which I decidedly don't want to do
(monkey-patching it, ok, i can live with that... :-))

It *seems* to me that the original reason for negative offset checks
stemmed from the fact that vint could have been written (and possibly
vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738

but the underlying issue and some of the patches seem to have been
addressing those problems; but a much shorter version of the patch was
committed -- despite the perf results not being indicative (i.e. it
could have been good with the longer patch) -- but to really
understand it, one would have to spend more than 10mins reading the
comments

Further to the point, I think negative offsets can be produced only on
the very first token, unless there is a bug in a filter (there was/is
a separate check for that in 6x and perhaps it is still there in 7x).
That would be much less restrictive than the current condition which
disallows all backward offsets. We never ran into an index corruption
in lucene 4-6x, so I really wonder if the "forbid all backwards
offsets" approach might be too restrictive.

Looks like I should create an issue...

On Thu, Aug 6, 2020 at 11:28 AM Gus Heck <gu...@gmail.com> wrote:
>
> I've had a nearly identical experience to what Dave describes, I also chafe under this restriction.
>
> On Thu, Aug 6, 2020 at 11:07 AM David Smiley <ds...@apache.org> wrote:
>>
>> I sympathize with your pain, Roman.
>>
>> It appears we can't really do index-time multi-word synonyms because of the offset ordering rule.  But it's not just synonyms, it's other forms of multi-token expansion.  Where I work, I've seen an interesting approach to mixed language text analysis in which a sophisticated Tokenizer effectively re-tokenizes an input multiple ways by producing a token stream that is a concatenation of different interpretations of the input.  On a Lucene upgrade, we had to "coarsen" the offsets to the point of having highlights that point to a whole sentence instead of the words in that sentence :-(.  I need to do something to fix this; I'm trying hard to resist modifying our Lucene fork for this constraint.  Maybe instead of concatenating, it might be interleaved / overlapped but the interpretations aren't necessarily aligned to make this possible without risking breaking position-sensitive queries.
>>
>> So... I'm not a fan of this constraint on offsets.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, they are not zero offsets - I was instinctively avoiding
>>> "negative offsets"; but they are indeed backward offsets.
>>>
>>> Here is the token stream as produced by the analyzer chain indexing
>>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>>>
>>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>>
>>> Sometimes, we'll even have a situation when synonyms overlap: for
>>> example "anti de sitter space time"
>>>
>>> "anti de sitter space time" -> "antidesitter space" (one token
>>> spanning offsets 0-26; it gets emitted with the first token "anti"
>>> right now)
>>> "space time" -> "spacetime" (synonym 16-26)
>>> "space" -> "universe" (25-26)
>>>
>>> Yes, weird, but useful if people want to search for `universe NEAR
>>> anti` -- but another usecase which would be prohibited by the "new"
>>> rule.
>>>
>>> DefaultIndexingChain checks new token offset against the last emitted
>>> token, so I don't see a way to emit the multi-token synonym with
>>> offsetts spanning multiple tokens if even one of these tokens was
>>> already emitted. And the complement is equally true: if multi-token is
>>> emitted as last of the group - it trips over `startOffset <
>>> invertState.lastStartOffset`
>>>
>>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>>
>>>
>>>   -roman
>>>
>>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>>> <lu...@mikemccandless.com> wrote:
>>> >
>>> > Hi Roman,
>>> >
>>> > Hmm, this is all very tricky!
>>> >
>>> > First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" that your analysis chain is trying to produce?
>>> >
>>> > Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go backwards" check in IndexWriter?  I thought IndexWriter is just checking that the startOffset for a token is not lower than the previous token's startOffset?  (And that the token's endOffset is not lower than its startOffset).
>>> >
>>> > So I am confused why your first example is tripping up on IW's offset checks.  Could you maybe redo the example, listing single token per line with the start/end offsets they are producing?
>>> >
>>> > Mike McCandless
>>> >
>>> > http://blog.mikemccandless.com
>>> >
>>> >
>>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:
>>> >>
>>> >> Hello devs,
>>> >>
>>> >> I wanted to create an issue but the helpful message in red letters
>>> >> reminded me to ask first.
>>> >>
>>> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>>> >> was introduced in LUCENE-7626
>>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>>> >>
>>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>>> >> made the change which made me automatically doubt myself. I must be
>>> >> wrong, hell, I was living in sin the past 5 years!
>>> >>
>>> >> Sadly, we have been indexing and searching large volumes of data
>>> >> without any corruption in index whatsover, but also without this new
>>> >> change:
>>> >>
>>> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>>> >>
>>> >> With that change, our multi-token synonyms house of cards is falling.
>>> >>
>>> >> Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
>>> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>> >>
>>> >> Recommended way to index multi-token synonyms appears to be this:
>>> >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>>> >>
>>> >> BUT, but! We don't want to place multi-token synonym into the same
>>> >> position as the other words. We want to preserve their positions! We
>>> >> want to preserve informaiton about offsets!
>>> >>
>>> >> Here is an example:
>>> >>
>>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>>> >>
>>> >> This is how it gets indexed
>>> >>
>>> >> [(0, []),
>>> >> (1, ['acr::hubble']),
>>> >> (2, ['constant']),
>>> >> (3, ['summary']),
>>> >> (4, []),
>>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>>> >> (6, ['acr::space', 'space']),
>>> >> (7, ['acr::telescope', 'telescope']),
>>> >> (8, ['program']),
>>> >>
>>> >> Notice the position 5 - multi-token synonym `syn::hubble space
>>> >> telescope` token is on the first token which started the group
>>> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>>> >> index the 'hubble' word there.
>>> >>
>>> >>  if you were to search for a phrase "HST program" it will be found
>>> >> because our search parser will search for ("HST ? ? program" | "Hubble
>>> >> Space Telescope program")
>>> >>
>>> >> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>>> >>
>>> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>>> >> other problem that Mike described -- "hst space" phrase search will
>>> >> NOT find this paper (and that is a correct behaviour)
>>> >>
>>> >> But all of this is possible only because lucene was indexing tokens
>>> >> with offsets that can be lower than the last emitted token; for
>>> >> example 'hubble space telescope' wil have offset 21-45; and the next
>>> >> emitted token "space" will have offset 28-33
>>> >>
>>> >> And it just works (lucene 6.x)
>>> >>
>>> >> Here is another proof with the appropriate verbiage ("crazy"):
>>> >>
>>> >> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>>> >>
>>> >> Zero offsets have been working wonderfully for us so far. And I
>>> >> actually cannot imagine how it can work without them - i.e. without
>>> >> the ability to emit a token stream with offsets that are lower than
>>> >> the last seen token.
>>> >>
>>> >> I haven't tried SynonymFlatten filter, but because of this line in the
>>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>>> >> to do what we need (as seen in the example above)
>>> >>
>>> >> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>> >>
>>> >> What would you say? Is it a bug, is it not a bug but just some special
>>> >> usecase? If it is a special usecase, what do we need to do? Plug in
>>> >> our own indexing chain?
>>> >>
>>> >> Thanks!
>>> >>
>>> >>   -roman
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Gus Heck <gu...@gmail.com>.

I've had a nearly identical experience to what Dave describes, I also chafe
under this restriction.

On Thu, Aug 6, 2020 at 11:07 AM David Smiley <ds...@apache.org> wrote:

> I sympathize with your pain, Roman.
>
> It appears we can't really do index-time multi-word synonyms because of
> the offset ordering rule.  But it's not just synonyms, it's other forms of
> multi-token expansion.  Where I work, I've seen an interesting approach to
> mixed language text analysis in which a sophisticated Tokenizer effectively
> re-tokenizes an input multiple ways by producing a token stream that is a
> concatenation of different interpretations of the input.  On a Lucene
> upgrade, we had to "coarsen" the offsets to the point of having highlights
> that point to a whole sentence instead of the words in that sentence :-(.
> I need to do something to fix this; I'm trying hard to resist modifying our
> Lucene fork for this constraint.  Maybe instead of concatenating, it might
> be interleaved / overlapped but the interpretations aren't necessarily
> aligned to make this possible without risking breaking position-sensitive
> queries.
>
> So... I'm not a fan of this constraint on offsets.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:
>
>> Hi Mike,
>>
>> Yes, they are not zero offsets - I was instinctively avoiding
>> "negative offsets"; but they are indeed backward offsets.
>>
>> Here is the token stream as produced by the analyzer chain indexing
>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>>
>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
>> offsetEnd=60
>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>
>> Sometimes, we'll even have a situation when synonyms overlap: for
>> example "anti de sitter space time"
>>
>> "anti de sitter space time" -> "antidesitter space" (one token
>> spanning offsets 0-26; it gets emitted with the first token "anti"
>> right now)
>> "space time" -> "spacetime" (synonym 16-26)
>> "space" -> "universe" (25-26)
>>
>> Yes, weird, but useful if people want to search for `universe NEAR
>> anti` -- but another usecase which would be prohibited by the "new"
>> rule.
>>
>> DefaultIndexingChain checks new token offset against the last emitted
>> token, so I don't see a way to emit the multi-token synonym with
>> offsetts spanning multiple tokens if even one of these tokens was
>> already emitted. And the complement is equally true: if multi-token is
>> emitted as last of the group - it trips over `startOffset <
>> invertState.lastStartOffset`
>>
>>
>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>
>>
>>   -roman
>>
>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> >
>> > Hi Roman,
>> >
>> > Hmm, this is all very tricky!
>> >
>> > First off, why do you call this "zero offsets"?  Isn't it "backwards
>> offsets" that your analysis chain is trying to produce?
>> >
>> > Second, in your first example, if you output the tokens in the right
>> order, they would not violate the "offsets do not go backwards" check in
>> IndexWriter?  I thought IndexWriter is just checking that the startOffset
>> for a token is not lower than the previous token's startOffset?  (And that
>> the token's endOffset is not lower than its startOffset).
>> >
>> > So I am confused why your first example is tripping up on IW's offset
>> checks.  Could you maybe redo the example, listing single token per line
>> with the start/end offsets they are producing?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com>
>> wrote:
>> >>
>> >> Hello devs,
>> >>
>> >> I wanted to create an issue but the helpful message in red letters
>> >> reminded me to ask first.
>> >>
>> >> While porting from lucene 6.x to 7x I'm struggling with a change that
>> >> was introduced in LUCENE-7626
>> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
>> >>
>> >> It is believed that zero offset tokens are bad bad - Mike McCandles
>> >> made the change which made me automatically doubt myself. I must be
>> >> wrong, hell, I was living in sin the past 5 years!
>> >>
>> >> Sadly, we have been indexing and searching large volumes of data
>> >> without any corruption in index whatsover, but also without this new
>> >> change:
>> >>
>> >>
>> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>> >>
>> >> With that change, our multi-token synonyms house of cards is falling.
>> >>
>> >> Mike has this wonderful blogpost explaining troubles with multi-token
>> synonyms:
>> >>
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >>
>> >> Recommended way to index multi-token synonyms appears to be this:
>> >>
>> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>> >>
>> >> BUT, but! We don't want to place multi-token synonym into the same
>> >> position as the other words. We want to preserve their positions! We
>> >> want to preserve informaiton about offsets!
>> >>
>> >> Here is an example:
>> >>
>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>> >>
>> >> This is how it gets indexed
>> >>
>> >> [(0, []),
>> >> (1, ['acr::hubble']),
>> >> (2, ['constant']),
>> >> (3, ['summary']),
>> >> (4, []),
>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
>> 'hubble'']),
>> >> (6, ['acr::space', 'space']),
>> >> (7, ['acr::telescope', 'telescope']),
>> >> (8, ['program']),
>> >>
>> >> Notice the position 5 - multi-token synonym `syn::hubble space
>> >> telescope` token is on the first token which started the group
>> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> >> index the 'hubble' word there.
>> >>
>> >>  if you were to search for a phrase "HST program" it will be found
>> >> because our search parser will search for ("HST ? ? program" | "Hubble
>> >> Space Telescope program")
>> >>
>> >> It simply found that by looking at synonyms: HST -> Hubble Space
>> Telescope
>> >>
>> >> And because of those funny 'syn::' prefixes, we don't suffer from the
>> >> other problem that Mike described -- "hst space" phrase search will
>> >> NOT find this paper (and that is a correct behaviour)
>> >>
>> >> But all of this is possible only because lucene was indexing tokens
>> >> with offsets that can be lower than the last emitted token; for
>> >> example 'hubble space telescope' wil have offset 21-45; and the next
>> >> emitted token "space" will have offset 28-33
>> >>
>> >> And it just works (lucene 6.x)
>> >>
>> >> Here is another proof with the appropriate verbiage ("crazy"):
>> >>
>> >>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>> >>
>> >> Zero offsets have been working wonderfully for us so far. And I
>> >> actually cannot imagine how it can work without them - i.e. without
>> >> the ability to emit a token stream with offsets that are lower than
>> >> the last seen token.
>> >>
>> >> I haven't tried SynonymFlatten filter, but because of this line in the
>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> >> to do what we need (as seen in the example above)
>> >>
>> >>
>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>> >>
>> >> What would you say? Is it a bug, is it not a bug but just some special
>> >> usecase? If it is a special usecase, what do we need to do? Plug in
>> >> our own indexing chain?
>> >>
>> >> Thanks!
>> >>
>> >>   -roman
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by David Smiley <ds...@apache.org>.

I sympathize with your pain, Roman.

It appears we can't really do index-time multi-word synonyms because of the
offset ordering rule.  But it's not just synonyms, it's other forms of
multi-token expansion.  Where I work, I've seen an interesting approach to
mixed language text analysis in which a sophisticated Tokenizer effectively
re-tokenizes an input multiple ways by producing a token stream that is a
concatenation of different interpretations of the input.  On a Lucene
upgrade, we had to "coarsen" the offsets to the point of having highlights
that point to a whole sentence instead of the words in that sentence :-(.
I need to do something to fix this; I'm trying hard to resist modifying our
Lucene fork for this constraint.  Maybe instead of concatenating, it might
be interleaved / overlapped but the interpretations aren't necessarily
aligned to make this possible without risking breaking position-sensitive
queries.

So... I'm not a fan of this constraint on offsets.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <ro...@gmail.com> wrote:

> Hi Mike,
>
> Yes, they are not zero offsets - I was instinctively avoiding
> "negative offsets"; but they are indeed backward offsets.
>
> Here is the token stream as produced by the analyzer chain indexing
> "THE HUBBLE constant: a summary of the hubble space telescope program"
>
> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38
> offsetEnd=60
> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
> term=space pos=1 type=word offsetStart=45 offsetEnd=50
> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>
> Sometimes, we'll even have a situation when synonyms overlap: for
> example "anti de sitter space time"
>
> "anti de sitter space time" -> "antidesitter space" (one token
> spanning offsets 0-26; it gets emitted with the first token "anti"
> right now)
> "space time" -> "spacetime" (synonym 16-26)
> "space" -> "universe" (25-26)
>
> Yes, weird, but useful if people want to search for `universe NEAR
> anti` -- but another usecase which would be prohibited by the "new"
> rule.
>
> DefaultIndexingChain checks new token offset against the last emitted
> token, so I don't see a way to emit the multi-token synonym with
> offsetts spanning multiple tokens if even one of these tokens was
> already emitted. And the complement is equally true: if multi-token is
> emitted as last of the group - it trips over `startOffset <
> invertState.lastStartOffset`
>
>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>
>
>   -roman
>
> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
> <lu...@mikemccandless.com> wrote:
> >
> > Hi Roman,
> >
> > Hmm, this is all very tricky!
> >
> > First off, why do you call this "zero offsets"?  Isn't it "backwards
> offsets" that your analysis chain is trying to produce?
> >
> > Second, in your first example, if you output the tokens in the right
> order, they would not violate the "offsets do not go backwards" check in
> IndexWriter?  I thought IndexWriter is just checking that the startOffset
> for a token is not lower than the previous token's startOffset?  (And that
> the token's endOffset is not lower than its startOffset).
> >
> > So I am confused why your first example is tripping up on IW's offset
> checks.  Could you maybe redo the example, listing single token per line
> with the start/end offsets they are producing?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com>
> wrote:
> >>
> >> Hello devs,
> >>
> >> I wanted to create an issue but the helpful message in red letters
> >> reminded me to ask first.
> >>
> >> While porting from lucene 6.x to 7x I'm struggling with a change that
> >> was introduced in LUCENE-7626
> >> (https://issues.apache.org/jira/browse/LUCENE-7626)
> >>
> >> It is believed that zero offset tokens are bad bad - Mike McCandles
> >> made the change which made me automatically doubt myself. I must be
> >> wrong, hell, I was living in sin the past 5 years!
> >>
> >> Sadly, we have been indexing and searching large volumes of data
> >> without any corruption in index whatsover, but also without this new
> >> change:
> >>
> >>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
> >>
> >> With that change, our multi-token synonyms house of cards is falling.
> >>
> >> Mike has this wonderful blogpost explaining troubles with multi-token
> synonyms:
> >>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>
> >> Recommended way to index multi-token synonyms appears to be this:
> >>
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
> >>
> >> BUT, but! We don't want to place multi-token synonym into the same
> >> position as the other words. We want to preserve their positions! We
> >> want to preserve informaiton about offsets!
> >>
> >> Here is an example:
> >>
> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
> >>
> >> This is how it gets indexed
> >>
> >> [(0, []),
> >> (1, ['acr::hubble']),
> >> (2, ['constant']),
> >> (3, ['summary']),
> >> (4, []),
> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope',
> 'hubble'']),
> >> (6, ['acr::space', 'space']),
> >> (7, ['acr::telescope', 'telescope']),
> >> (8, ['program']),
> >>
> >> Notice the position 5 - multi-token synonym `syn::hubble space
> >> telescope` token is on the first token which started the group
> >> (emitted by Lucene's synonym filter). hst is another synonym; we also
> >> index the 'hubble' word there.
> >>
> >>  if you were to search for a phrase "HST program" it will be found
> >> because our search parser will search for ("HST ? ? program" | "Hubble
> >> Space Telescope program")
> >>
> >> It simply found that by looking at synonyms: HST -> Hubble Space
> Telescope
> >>
> >> And because of those funny 'syn::' prefixes, we don't suffer from the
> >> other problem that Mike described -- "hst space" phrase search will
> >> NOT find this paper (and that is a correct behaviour)
> >>
> >> But all of this is possible only because lucene was indexing tokens
> >> with offsets that can be lower than the last emitted token; for
> >> example 'hubble space telescope' wil have offset 21-45; and the next
> >> emitted token "space" will have offset 28-33
> >>
> >> And it just works (lucene 6.x)
> >>
> >> Here is another proof with the appropriate verbiage ("crazy"):
> >>
> >>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
> >>
> >> Zero offsets have been working wonderfully for us so far. And I
> >> actually cannot imagine how it can work without them - i.e. without
> >> the ability to emit a token stream with offsets that are lower than
> >> the last seen token.
> >>
> >> I haven't tried SynonymFlatten filter, but because of this line in the
> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> >> to do what we need (as seen in the example above)
> >>
> >>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
> >>
> >> What would you say? Is it a bug, is it not a bug but just some special
> >> usecase? If it is a special usecase, what do we need to do? Plug in
> >> our own indexing chain?
> >>
> >> Thanks!
> >>
> >>   -roman
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Roman Chyla <ro...@gmail.com>.

Hi Mike,

Yes, they are not zero offsets - I was instinctively avoiding
"negative offsets"; but they are indeed backward offsets.

Here is the token stream as produced by the analyzer chain indexing
"THE HUBBLE constant: a summary of the hubble space telescope program"

term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
term=constant pos=1 type=word offsetStart=11 offsetEnd=20
term=summary pos=1 type=word offsetStart=23 offsetEnd=30
term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
term=space pos=1 type=word offsetStart=45 offsetEnd=50
term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
term=program pos=1 type=word offsetStart=61 offsetEnd=68

Sometimes, we'll even have a situation when synonyms overlap: for
example "anti de sitter space time"

"anti de sitter space time" -> "antidesitter space" (one token
spanning offsets 0-26; it gets emitted with the first token "anti"
right now)
"space time" -> "spacetime" (synonym 16-26)
"space" -> "universe" (25-26)

Yes, weird, but useful if people want to search for `universe NEAR
anti` -- but another usecase which would be prohibited by the "new"
rule.

DefaultIndexingChain checks new token offset against the last emitted
token, so I don't see a way to emit the multi-token synonym with
offsetts spanning multiple tokens if even one of these tokens was
already emitted. And the complement is equally true: if multi-token is
emitted as last of the group - it trips over `startOffset <
invertState.lastStartOffset`

https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915


  -roman

On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> Hi Roman,
>
> Hmm, this is all very tricky!
>
> First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" that your analysis chain is trying to produce?
>
> Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go backwards" check in IndexWriter?  I thought IndexWriter is just checking that the startOffset for a token is not lower than the previous token's startOffset?  (And that the token's endOffset is not lower than its startOffset).
>
> So I am confused why your first example is tripping up on IW's offset checks.  Could you maybe redo the example, listing single token per line with the start/end offsets they are producing?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:
>>
>> Hello devs,
>>
>> I wanted to create an issue but the helpful message in red letters
>> reminded me to ask first.
>>
>> While porting from lucene 6.x to 7x I'm struggling with a change that
>> was introduced in LUCENE-7626
>> (https://issues.apache.org/jira/browse/LUCENE-7626)
>>
>> It is believed that zero offset tokens are bad bad - Mike McCandles
>> made the change which made me automatically doubt myself. I must be
>> wrong, hell, I was living in sin the past 5 years!
>>
>> Sadly, we have been indexing and searching large volumes of data
>> without any corruption in index whatsover, but also without this new
>> change:
>>
>> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>>
>> With that change, our multi-token synonyms house of cards is falling.
>>
>> Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>
>> Recommended way to index multi-token synonyms appears to be this:
>> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>>
>> BUT, but! We don't want to place multi-token synonym into the same
>> position as the other words. We want to preserve their positions! We
>> want to preserve informaiton about offsets!
>>
>> Here is an example:
>>
>> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>>
>> This is how it gets indexed
>>
>> [(0, []),
>> (1, ['acr::hubble']),
>> (2, ['constant']),
>> (3, ['summary']),
>> (4, []),
>> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> (6, ['acr::space', 'space']),
>> (7, ['acr::telescope', 'telescope']),
>> (8, ['program']),
>>
>> Notice the position 5 - multi-token synonym `syn::hubble space
>> telescope` token is on the first token which started the group
>> (emitted by Lucene's synonym filter). hst is another synonym; we also
>> index the 'hubble' word there.
>>
>>  if you were to search for a phrase "HST program" it will be found
>> because our search parser will search for ("HST ? ? program" | "Hubble
>> Space Telescope program")
>>
>> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>>
>> And because of those funny 'syn::' prefixes, we don't suffer from the
>> other problem that Mike described -- "hst space" phrase search will
>> NOT find this paper (and that is a correct behaviour)
>>
>> But all of this is possible only because lucene was indexing tokens
>> with offsets that can be lower than the last emitted token; for
>> example 'hubble space telescope' wil have offset 21-45; and the next
>> emitted token "space" will have offset 28-33
>>
>> And it just works (lucene 6.x)
>>
>> Here is another proof with the appropriate verbiage ("crazy"):
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>>
>> Zero offsets have been working wonderfully for us so far. And I
>> actually cannot imagine how it can work without them - i.e. without
>> the ability to emit a token stream with offsets that are lower than
>> the last seen token.
>>
>> I haven't tried SynonymFlatten filter, but because of this line in the
>> DefaultIndexingChain - I'm convinced the flatten symbol is not going
>> to do what we need (as seen in the example above)
>>
>> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>>
>> What would you say? Is it a bug, is it not a bug but just some special
>> usecase? If it is a special usecase, what do we need to do? Plug in
>> our own indexing chain?
>>
>> Thanks!
>>
>>   -roman
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Roman,

Hmm, this is all very tricky!

First off, why do you call this "zero offsets"?  Isn't it "backwards
offsets" that your analysis chain is trying to produce?

Second, in your first example, if you output the tokens in the right order,
they would not violate the "offsets do not go backwards" check in
IndexWriter?  I thought IndexWriter is just checking that the startOffset
for a token is not lower than the previous token's startOffset?  (And that
the token's endOffset is not lower than its startOffset).

So I am confused why your first example is tripping up on IW's offset
checks.  Could you maybe redo the example, listing single token per line
with the start/end offsets they are producing?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <ro...@gmail.com> wrote:

> Hello devs,
>
> I wanted to create an issue but the helpful message in red letters
> reminded me to ask first.
>
> While porting from lucene 6.x to 7x I'm struggling with a change that
> was introduced in LUCENE-7626
> (https://issues.apache.org/jira/browse/LUCENE-7626)
>
> It is believed that zero offset tokens are bad bad - Mike McCandles
> made the change which made me automatically doubt myself. I must be
> wrong, hell, I was living in sin the past 5 years!
>
> Sadly, we have been indexing and searching large volumes of data
> without any corruption in index whatsover, but also without this new
> change:
>
>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>
> With that change, our multi-token synonyms house of cards is falling.
>
> Mike has this wonderful blogpost explaining troubles with multi-token
> synonyms:
>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> Recommended way to index multi-token synonyms appears to be this:
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>
> BUT, but! We don't want to place multi-token synonym into the same
> position as the other words. We want to preserve their positions! We
> want to preserve informaiton about offsets!
>
> Here is an example:
>
> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>
> This is how it gets indexed
>
> [(0, []),
> (1, ['acr::hubble']),
> (2, ['constant']),
> (3, ['summary']),
> (4, []),
> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
> (6, ['acr::space', 'space']),
> (7, ['acr::telescope', 'telescope']),
> (8, ['program']),
>
> Notice the position 5 - multi-token synonym `syn::hubble space
> telescope` token is on the first token which started the group
> (emitted by Lucene's synonym filter). hst is another synonym; we also
> index the 'hubble' word there.
>
>  if you were to search for a phrase "HST program" it will be found
> because our search parser will search for ("HST ? ? program" | "Hubble
> Space Telescope program")
>
> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>
> And because of those funny 'syn::' prefixes, we don't suffer from the
> other problem that Mike described -- "hst space" phrase search will
> NOT find this paper (and that is a correct behaviour)
>
> But all of this is possible only because lucene was indexing tokens
> with offsets that can be lower than the last emitted token; for
> example 'hubble space telescope' wil have offset 21-45; and the next
> emitted token "space" will have offset 28-33
>
> And it just works (lucene 6.x)
>
> Here is another proof with the appropriate verbiage ("crazy"):
>
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>
> Zero offsets have been working wonderfully for us so far. And I
> actually cannot imagine how it can work without them - i.e. without
> the ability to emit a token stream with offsets that are lower than
> the last seen token.
>
> I haven't tried SynonymFlatten filter, but because of this line in the
> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> to do what we need (as seen in the example above)
>
>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>
> What would you say? Is it a bug, is it not a bug but just some special
> usecase? If it is a special usecase, what do we need to do? Plug in
> our own indexing chain?
>
> Thanks!
>
>   -roman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>