You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@netbeans.apache.org by ve...@gmail.com, ve...@gmail.com on 2018/12/27 04:35:51 UTC

LexerInput.read() returns null characters after unicode characters.

I have the following string which I am trying to read through LexerInput.read(). 

quote
టోకెన్
quote

ట ో క ె న ్

5 (quote) + 6(టోకెన్) + 5 (quote) + 2 new line characters that is a total of 18 characters. LexerInput.read() returns all the characters as expected. But it keeps going and returns null characters for each call to LexerInput.read().  

I tried to stop at the first null character, backup one character and return a null. But this is the error I see.

I guess it makes sense since LexerInputOperation doesn't allow resetting the offsets,

> returned null token but lexerInput.readLength()=1
> lexer-state: DEFAULT_STATE
> tokenStartOffset=18, readOffset=19, lookaheadOffset=19
> Chars: "


I realise this may not actually have anything to do with netbeans. Since there is a decent chance people may have seen this behaviour before, any insight would be helpful. 

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.incubator.apache.org
For additional commands, e-mail: dev-help@netbeans.incubator.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: LexerInput.read() returns null characters after unicode characters.

Posted by Tim Boudreau <ni...@gmail.com>.

On Thu, Dec 27, 2018 at 5:18 PM venkatram.akkineni@gmail.com <
venkatram.akkineni@gmail.com> wrote:

> First of all, thank you for the detailed answer.
>
> > LexerInput returns a primitive int. It cannot return null.
>
> Yes sir, so I get a integer value zero. EOF as you know would be a -1. I
> believe it is returning a null character '\0'.


Most likely if you are seeing a 0, it is because there is a 0.

Any chance you're reading UTF-32 as UTF-16, or something like that? That
would get you unexpected zeros that are actually the other half of a
partially read character.


> If you are using ANTLR, does your grammar read the entire file? You need a
> > rule that includes EOF explicitly, or it is easy to have a grammar which
> > looks like it works most of the time, but for some files will hand you an
> > eof token without giving you tokens for the entire file - it does what
> you
> > tell it to, so if you didn't tell it that the content to parse ends only
> > when the end of the file is encountered, then it once it has satisfied
> the
> > rules you gave it, it is "done" as far as it is concerned.
>
> This is a hand written lexer. I humbly submit that ANTLR is beyond my
> comprehension. I don't think it is even ANTLR, think it may the prospect of
> having to deal with generated code. That same reason has kept me away from
> various coffeescript, angular and a few others. I caught this issue while
> writing unit tests for the lexer. Seeing that the coverage is at 80% at
> present, I should say I haven't encountered any unpredictable EOFs so far.
> Since I do the integer comparison manually using ==, it is hard to miss EOF
> characters.


In practice, the generated Antlr code is pretty easy to deal with, but I've
hand-written lexers too. I wouldn't compare it with coffeescript and
similar, where your entire program is made pretty opaque - you get some
straightforward ast classes and visitor interfaces.


> So, when in that state, read the remaining characters (if any) into a
> StringBuilder, log them to stdout, see
> > what they are and modify your grammar or whatever does the lexing to
> ensure
> > they really get processed.
>
> I will definitely try this. My suspicion is there are some invisible
> characters I am not seeing. May be printing them to console will help.


Check what character encoding the bytes are being read with. If you're not
specifying it, you get whatever the system default is, which is always
wrong sometimes. If the lexer input is really handing you zeros, that's
probably the culprit.

-Tim





>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@netbeans.incubator.apache.org
> For additional commands, e-mail: dev-help@netbeans.incubator.apache.org
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>
> --
http://timboudreau.com

Re: LexerInput.read() returns null characters after unicode characters.

Posted by ve...@gmail.com, ve...@gmail.com.

First of all, thank you for the detailed answer.

> LexerInput returns a primitive int. It cannot return null.

Yes sir, so I get a integer value zero. EOF as you know would be a -1. I believe it is returning a null character '\0'. 

> The editor's lexer plumbing insists that tokens be returned for every
> character in a file. If there is a mismatch, it assumes that is a bug in
> the lexer. The sum of the lengths of all tokens returned while lexing must
> match the number of characters actually in the input. It looks like your
> lexer is trying to bail out without consuming all the characters in the
> file.

Yep, that is what led me to believe there may be more characters than I am visually seeing in the file.

> *Your lexer* is returning a null token - signalling EOF - before the actual
> end of the file/document/input.
I did this just to see if that would work when I discovered that Lexer won't allow premature termination of lexing. One would have to read till the EOF.

> If you are using ANTLR, does your grammar read the entire file? You need a
> rule that includes EOF explicitly, or it is easy to have a grammar which
> looks like it works most of the time, but for some files will hand you an
> eof token without giving you tokens for the entire file - it does what you
> tell it to, so if you didn't tell it that the content to parse ends only
> when the end of the file is encountered, then it once it has satisfied the
> rules you gave it, it is "done" as far as it is concerned.

This is a hand written lexer. I humbly submit that ANTLR is beyond my comprehension. I don't think it is even ANTLR, think it may the prospect of having to deal with generated code. That same reason has kept me away from various coffeescript, angular and a few others. I caught this issue while writing unit tests for the lexer. Seeing that the coverage is at 80% at present, I should say I haven't encountered any unpredictable EOFs so far.  Since I do the integer comparison manually using ==, it is hard to miss EOF characters. 

> So, when in that state, read the remaining characters (if any) into a StringBuilder, log them to stdout, see
> what they are and modify your grammar or whatever does the lexing to ensure
> they really get processed.

I will definitely try this. My suspicion is there are some invisible characters I am not seeing. May be printing them to console will help.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@netbeans.incubator.apache.org
For additional commands, e-mail: dev-help@netbeans.incubator.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: LexerInput.read() returns null characters after unicode characters.

Posted by Tim Boudreau <ni...@gmail.com>.

On Wed, Dec 26, 2018 at 11:35 PM venkatram.akkineni@gmail.com <
venkatram.akkineni@gmail.com> wrote:

> I have the following string which I am trying to read through
> LexerInput.read().
>
> quote
> టోకెన్
> quote
>
> ట ో క ె న ్
>
> 5 (quote) + 6(టోకెన్) + 5 (quote) + 2 new line characters that is a total
> of 18 characters. LexerInput.read() returns all the characters as expected.
> But it keeps going and returns null characters for each call to
> LexerInput.read().
>

LexerInput returns a primitive int.  It cannot return null.

> I tried to stop at the first null character, backup one character and
> return a null. But this is the error I see.
>

The editor's lexer plumbing insists that tokens be returned for every
character in a file.  If there is a mismatch, it assumes that is a bug in
the lexer.  The sum of the lengths of all tokens returned while lexing must
match the number of characters actually in the input.  It looks like your
lexer is trying to bail out without consuming all the characters in the
file.

If you are backing up one character, that guarantees the editor
infrastructure does not think it is at the end of the file when you return
null, and you will get an exception.

> I guess it makes sense since LexerInputOperation doesn't allow resetting
> the offsets,
>
> > returned null token but lexerInput.readLength()=1
> > lexer-state: DEFAULT_STATE
> > tokenStartOffset=18, readOffset=19, lookaheadOffset=19
> > Chars: "
>

*Your lexer* is returning a null token - signalling EOF - before the actual
end of the file/document/input.

If you are using ANTLR, does your grammar read the entire file?  You need a
rule that includes EOF explicitly, or it is easy to have a grammar which
looks like it works most of the time, but for some files will hand you an
eof token without giving you tokens for the entire file - it does what you
tell it to, so if you didn't tell it that the content to parse ends only
when the end of the file is encountered, then it once it has satisfied the
rules you gave it, it is "done" as far as it is concerned.  Frequently this
manifests as getting an EOF token from ANTLR which is not zero-length and
represents trailing whitespace - in which case, you need to return a token
for that from the call to nextToken(), and set a boolean (or however you
want to do it) to return null on the *next* call.  Or, make sure your
grammar will always, always return non-EOF tokens for every byte of input
it is given.

In a pinch, to debug this stuff (whether or not you're using ANTLR), add
some printlns in your lexer's nextToken() and see what you're really
getting.  In particular, you must have some code that *thinks* it knows it
is at EOF (prematurely) if your lexer is returning null.  If your lexer is
going to return null, then your LexerInput's read() method should return
-1.  If it doesn't, you are leaving some characters unprocessed and will
get the exception and message you posted.  So, when in that state, read the
remaining characters (if any) into a StringBuilder, log them to stdout, see
what they are and modify your grammar or whatever does the lexing to ensure
they really get processed.

Then write some tests to drive your lexer with various horribly mangled
input (including 0-length) to ensure it never gets re-broken.

-Tim