You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@netbeans.apache.org by Eirik Bakke <eb...@ultorg.com> on 2018/12/01 01:05:58 UTC

RE: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

Just in case it's useful: A while back I wrote a very carefully implemented adapter from NetBeans' org.netbeans.spi.lexer.LexerInput class to ANTLR's org.antlr.v4.runtime.CharStream class:

https://gist.github.com/eirikbakke/51cf4c9375880acd4741/c83dd7e64b91674c6c2bf9d8473c7249a6d66ceb

The equivalent class in the repo you point to seems to be this one:

https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/master/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/code/coloring/ANTLRv4CharStream.java

I remember jumping through various hoops to deal with EOF correctly... you could always try to replace the existing CharStream implementation with mine and see what changes.

-- Eirik

-----Original Message-----
From: Tim Boudreau <ni...@gmail.com> 
Sent: Friday, November 30, 2018 6:29 PM
To: dev@netbeans.incubator.apache.org; dev@netbeans.apache.org
Subject: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

As I mentioned breifly a while back, I decided to do some patching of the Antlr module on Github.  I'm hoping the outcome of this is both

 1. Better Antlr support - in particular
   1a. add a bunch of missing features, like code formatting and semantic highlighting
   1b. the ability to associate a file extension with a grammar and actually edit files with syntax and error highlighting (a preview window lets you tie colorings to specific token types and rules) - https://timboudreau.com/files/screen/11-30-2018_18-17-53.png
   1c. Much improved cycle time between making an edit to your grammar and seeing if you broke something - as in, instantaneous  2. Some modules that make integrating languages based on Antlr grammars really easy - there are a lot of identical adaptering boilerplate everyone needs for that, and some impedance mismatches everyone has to discover the hard way, such as:
   2a. If your grammar doesn't consume EOF, it may not tokenize the entire file, and that will make your netbeans lexer explode horribly in the middle of painting the main window, making the IDE unusable
   2b. Antlr EOF tokens may actually contain some text, which, if you don't consume it, also detonates a bomb in the middle of painting

In particular, doing the dynamic syntax highlighting means programmatically registering new languages that appear, disappear, and have their set of tokens change on the fly, persisting the mime-type:grammar mappings across restarts, *and* doing something reasonable in the case that the grammar was deleted or is in an unusable state.  Not to mention generating Java source files with Antlr into an in-memory filesystem, running javac against all that and invoking the result in an isolated classloader, extracting a complete lex and parse, and feeding that into all of that language machinery (believe it or not, on my laptop, all of that can run in 82 milliseconds for a fairly complex grammar - you really can run antlr generation, compile, load and invoke a parser on every keystroke - though it was a lot of work getting there).

NetBeans has some very nice infrastructure for declarative registration of languages, most of which is not terribly useful for this. Fortunately, registering a MimeDataProvider and a no-arg-constructor MIMEResolver solve most of that.

However, getting the Lexer integration solid - i.e. a lexer that has to work even with a completely hosed grammar driving it - seems to be my Waterloo.  Part of the problem is that the initialization order is necessarily backwards:  Ordinarily in NetBeans, you register a language, the LanguageHierarchy knows the token types it has, the editor infrastructure pokes at that at its leisure, and when it's ready, asks for a lexer to chew on some text.  But in this case, the language hierarchy doesn't know anything about the language until it has generated, compiled and invoked it - i.e. *during lexer initialization* is the first chance the language hierarchy has to actually get the set of tokens ids for the language (to a degree I can hack around this with per-mime-type ThreadLocal<String>, for cases where I can wrap the entry point that triggers lexer invocation).

So, some issues I'm wondering if anyone has guidance on (is Mila Metelka still around?  He would know this stuff inside and out):

 - LexerInput - for this case, I need to, on the first invocation of the lexer's nextToken(), or in its constructor, extract the entire text to be parsed, feed it through a generated Antlr lexer and build a list of tokens
- nextToken() will simply return them:
   - It is non-obvious from the code and Javadoc, whether you should call
readText() without first making calls to read() to sequentially read tokens
- it appears to work, most of the time (in which case, what is calling
read?) and is what examples generally do.
   - Sometimes you get a LexerInput which has already had some, but not all, characters read from it.  It is not at all clear what causes that (or whether rewinding is appropriate).
   - If you got text back from readText() during lexer initialization, parsed it and generated a pre-cooked list of tokens to return from your lexer, you still need to call read() to scan forward to the token you're returning (even though if readText() returned the complete text, the LexerInput presumably is already at EOF)
   - LexerInput behaves differently when invoked from inside the closure of
LanguageHierarchy.createLexer() versus from within a call to nextToken() - sometimes it will return 0 length text (and be backed by a TextLexerInputOperation that has null backing text, a readEndOffset of 0, yet will return 65535 - Character.MAX_VALUE from calls to read()) when created against a document that definitely does have contents - I suspect some kind of race condition
 - WrapTokenIdCache - in the case that the set of tokens for the language has changed (which can happen while the lexer is being constructed), it is easy to get an AIOOBE because it is caching a set of token ids that is no longer correct.  I do have the LanguageHierarchy fire PROP_LANGUAGE when this is updated and throw away the existing LanguageHierarchy instance, but that does not abort use of the lexer whose constructor is currently running which was created by the old one (I can probably work around this by using the stale token IDs, though it's likely to screw with highlighting for the first reparse after every edit to the grammar).
    - In particular, this problem makes it impossible to return a "dummy parse" of an open file during IDE startup with fake token ids for EOF and "everything else", to avoid generating and compiling a ton of stuff before the main window opens - whatever WrapTokenIdCache maps to, it seems to persist past the lifetime of the LanguageHierarchy it was created for (maybe mapped to mime type string?)
 - When you get passed an empty LexerInput and an exception is thrown as the result of nextToken(), it appears that createLexer() is subsequently invoked over and over, each time with a LexerInput which contains one more character of the file
 - What is a lexer *actually* supposed to return for tokens when it is passed zero length text?  Returning an empty token is an error.  So is returning a zero length token. ???
 - Occasionally, completely nonsensical errors in parsing that I can't find any explanation for, that don't directly involve my code - see stack traces below

I have a horrible, hacky lexer implementation that accidentally works most of the time.  Factoring the same code into something human-readable, weirdly, turns it into something that explodes on use - and the only difference seems to be timing and quantity of logging statements and possibly a smidgen of call ordering.

Any suggestions or hints on better ways to do any of this are welcome.

The raunchy-but-works-most-of-the-time lexer implementation is here:
https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/features-tdb/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/file/preview/AdhocLexer.java

(the original author, I think, must have converted a bunch of svn branch folders to git by just committing them - the layout is kind of a mess, and everything would be easier if I mavenized it)

-Tim


INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception occurred during token hierarchy updating. Token hierarchy will be rebuilt from scratch.
java.lang.IndexOutOfBoundsException: startOffset=1073741823, endOffset=80,
inputSourceText.length()=80
at
org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
at
org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
at
org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:254)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
at
org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
at
org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
at
org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)
at
org.netbeans.lib.lexer.inc.DocumentInput.insertUpdate(DocumentInput.java:117)


INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception occurred during token hierarchy updating. Token hierarchy will be rebuilt from scratch.
java.lang.IndexOutOfBoundsException: startOffset=-2147483569, endOffset=79,
inputSourceText.length()=79
at
org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
at
org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
at
org.netbeans.lib.lexer.inc.IncTokenList.replaceTokens(IncTokenList.java:354)
at
org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:258)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
at
org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
at
org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
at
org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)

-Tim


--
http://timboudreau.com

Re: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

Posted by Tim Boudreau <ni...@gmail.com>.
On Fri, Nov 30, 2018 at 8:13 PM Eirik Bakke <eb...@ultorg.com> wrote:

> Just in case it's useful: A while back I wrote a very carefully
> implemented adapter from NetBeans' org.netbeans.spi.lexer.LexerInput class
> to ANTLR's org.antlr.v4.runtime.CharStream class:
>
>
> https://gist.github.com/eirikbakke/51cf4c9375880acd4741/c83dd7e64b91674c6c2bf9d8473c7249a6d66ceb
>

Thanks! Yeah, everybody using Antlr gets to write one of these and spend
way too much time debugging it :-)

In my peculiar case, I actually want to grab the entire text as a string
and pass it in that way, because the class loader the parsing happens in
may load the version of Antlr the project depends on. So no Antlr types
traverse that boundary in our out.

Currently that's not happening, but it's the right thing, and what I did in
the first implementation which did generation and compilation to disk
rather than memory (and was 12x slower). I suppose I could wrap it in a
dynamic proxy inside the classloader, but it would present an opportunity
to leak classes out of it, and it's important that everything involved can
be garbage collected.

-Tim



> The equivalent class in the repo you point to seems to be this one:
>
>
> https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/master/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/code/coloring/ANTLRv4CharStream.java
>
> I remember jumping through various hoops to deal with EOF correctly... you
> could always try to replace the existing CharStream implementation with
> mine and see what changes.
>
> -- Eirik
>
> -----Original Message-----
> From: Tim Boudreau <ni...@gmail.com>
> Sent: Friday, November 30, 2018 6:29 PM
> To: dev@netbeans.incubator.apache.org; dev@netbeans.apache.org
> Subject: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy
>
> As I mentioned breifly a while back, I decided to do some patching of the
> Antlr module on Github.  I'm hoping the outcome of this is both
>
>  1. Better Antlr support - in particular
>    1a. add a bunch of missing features, like code formatting and semantic
> highlighting
>    1b. the ability to associate a file extension with a grammar and
> actually edit files with syntax and error highlighting (a preview window
> lets you tie colorings to specific token types and rules) -
> https://timboudreau.com/files/screen/11-30-2018_18-17-53.png
>    1c. Much improved cycle time between making an edit to your grammar and
> seeing if you broke something - as in, instantaneous  2. Some modules that
> make integrating languages based on Antlr grammars really easy - there are
> a lot of identical adaptering boilerplate everyone needs for that, and some
> impedance mismatches everyone has to discover the hard way, such as:
>    2a. If your grammar doesn't consume EOF, it may not tokenize the entire
> file, and that will make your netbeans lexer explode horribly in the middle
> of painting the main window, making the IDE unusable
>    2b. Antlr EOF tokens may actually contain some text, which, if you
> don't consume it, also detonates a bomb in the middle of painting
>
> In particular, doing the dynamic syntax highlighting means
> programmatically registering new languages that appear, disappear, and have
> their set of tokens change on the fly, persisting the mime-type:grammar
> mappings across restarts, *and* doing something reasonable in the case that
> the grammar was deleted or is in an unusable state.  Not to mention
> generating Java source files with Antlr into an in-memory filesystem,
> running javac against all that and invoking the result in an isolated
> classloader, extracting a complete lex and parse, and feeding that into all
> of that language machinery (believe it or not, on my laptop, all of that
> can run in 82 milliseconds for a fairly complex grammar - you really can
> run antlr generation, compile, load and invoke a parser on every keystroke
> - though it was a lot of work getting there).
>
> NetBeans has some very nice infrastructure for declarative registration of
> languages, most of which is not terribly useful for this. Fortunately,
> registering a MimeDataProvider and a no-arg-constructor MIMEResolver solve
> most of that.
>
> However, getting the Lexer integration solid - i.e. a lexer that has to
> work even with a completely hosed grammar driving it - seems to be my
> Waterloo.  Part of the problem is that the initialization order is
> necessarily backwards:  Ordinarily in NetBeans, you register a language,
> the LanguageHierarchy knows the token types it has, the editor
> infrastructure pokes at that at its leisure, and when it's ready, asks for
> a lexer to chew on some text.  But in this case, the language hierarchy
> doesn't know anything about the language until it has generated, compiled
> and invoked it - i.e. *during lexer initialization* is the first chance the
> language hierarchy has to actually get the set of tokens ids for the
> language (to a degree I can hack around this with per-mime-type
> ThreadLocal<String>, for cases where I can wrap the entry point that
> triggers lexer invocation).
>
> So, some issues I'm wondering if anyone has guidance on (is Mila Metelka
> still around?  He would know this stuff inside and out):
>
>  - LexerInput - for this case, I need to, on the first invocation of the
> lexer's nextToken(), or in its constructor, extract the entire text to be
> parsed, feed it through a generated Antlr lexer and build a list of tokens
> - nextToken() will simply return them:
>    - It is non-obvious from the code and Javadoc, whether you should call
> readText() without first making calls to read() to sequentially read tokens
> - it appears to work, most of the time (in which case, what is calling
> read?) and is what examples generally do.
>    - Sometimes you get a LexerInput which has already had some, but not
> all, characters read from it.  It is not at all clear what causes that (or
> whether rewinding is appropriate).
>    - If you got text back from readText() during lexer initialization,
> parsed it and generated a pre-cooked list of tokens to return from your
> lexer, you still need to call read() to scan forward to the token you're
> returning (even though if readText() returned the complete text, the
> LexerInput presumably is already at EOF)
>    - LexerInput behaves differently when invoked from inside the closure of
> LanguageHierarchy.createLexer() versus from within a call to nextToken() -
> sometimes it will return 0 length text (and be backed by a
> TextLexerInputOperation that has null backing text, a readEndOffset of 0,
> yet will return 65535 - Character.MAX_VALUE from calls to read()) when
> created against a document that definitely does have contents - I suspect
> some kind of race condition
>  - WrapTokenIdCache - in the case that the set of tokens for the language
> has changed (which can happen while the lexer is being constructed), it is
> easy to get an AIOOBE because it is caching a set of token ids that is no
> longer correct.  I do have the LanguageHierarchy fire PROP_LANGUAGE when
> this is updated and throw away the existing LanguageHierarchy instance, but
> that does not abort use of the lexer whose constructor is currently running
> which was created by the old one (I can probably work around this by using
> the stale token IDs, though it's likely to screw with highlighting for the
> first reparse after every edit to the grammar).
>     - In particular, this problem makes it impossible to return a "dummy
> parse" of an open file during IDE startup with fake token ids for EOF and
> "everything else", to avoid generating and compiling a ton of stuff before
> the main window opens - whatever WrapTokenIdCache maps to, it seems to
> persist past the lifetime of the LanguageHierarchy it was created for
> (maybe mapped to mime type string?)
>  - When you get passed an empty LexerInput and an exception is thrown as
> the result of nextToken(), it appears that createLexer() is subsequently
> invoked over and over, each time with a LexerInput which contains one more
> character of the file
>  - What is a lexer *actually* supposed to return for tokens when it is
> passed zero length text?  Returning an empty token is an error.  So is
> returning a zero length token. ???
>  - Occasionally, completely nonsensical errors in parsing that I can't
> find any explanation for, that don't directly involve my code - see stack
> traces below
>
> I have a horrible, hacky lexer implementation that accidentally works most
> of the time.  Factoring the same code into something human-readable,
> weirdly, turns it into something that explodes on use - and the only
> difference seems to be timing and quantity of logging statements and
> possibly a smidgen of call ordering.
>
> Any suggestions or hints on better ways to do any of this are welcome.
>
> The raunchy-but-works-most-of-the-time lexer implementation is here:
>
> https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/features-tdb/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/file/preview/AdhocLexer.java
>
> (the original author, I think, must have converted a bunch of svn branch
> folders to git by just committing them - the layout is kind of a mess, and
> everything would be easier if I mavenized it)
>
> -Tim
>
>
> INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception
> occurred during token hierarchy updating. Token hierarchy will be rebuilt
> from scratch.
> java.lang.IndexOutOfBoundsException: startOffset=1073741823, endOffset=80,
> inputSourceText.length()=80
> at
>
> org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
> at
>
> org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
> at
>
> org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:254)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
> at
>
> org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
> at
>
> org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
> at
>
> org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)
> at
>
> org.netbeans.lib.lexer.inc.DocumentInput.insertUpdate(DocumentInput.java:117)
>
>
> INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception
> occurred during token hierarchy updating. Token hierarchy will be rebuilt
> from scratch.
> java.lang.IndexOutOfBoundsException: startOffset=-2147483569, endOffset=79,
> inputSourceText.length()=79
> at
>
> org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
> at
>
> org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
> at
>
> org.netbeans.lib.lexer.inc.IncTokenList.replaceTokens(IncTokenList.java:354)
> at
>
> org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:258)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
> at
>
> org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
> at
>
> org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
> at
>
> org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
> at
>
> org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)
>
> -Tim
>
>
> --
> http://timboudreau.com
>
-- 
http://timboudreau.com