You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@annotator.apache.org by GitBox <gi...@apache.org> on 2020/08/16 22:40:33 UTC

[GitHub] [incubator-annotator] tilgovi commented on issue #75: Support TextPositionSelector (in the dom package)

tilgovi commented on issue #75:
URL: https://github.com/apache/incubator-annotator/issues/75#issuecomment-674586963

> The text MUST be normalized before recording the annotation. Thus HTML/XML tags SHOULD be removed and character entities SHOULD be replaced with the character that they encode.

As long as we stick to `textContent` or `innerText`, we are covered here. The tags are not part of this and entities are already replaced.

> Possibly more problematic, can one even access the source html accurately enough through the DOM? Might a source parser have modified whitespace, thus leading to miscounts? I am not even talking about executed scripts that may modify the DOM too, I suppose we have to disregard that scenario.

We should be fine, at least for text nodes. The CSS white space properties make it important that parsers preserve the text nodes as is.

I think the spec is still somewhat vague and open to interpretation. I am partial to using `innerText` because it's the closest to the actual presentation. Regardless of what we choose, we have some work that supports any decision and helps us handle characters with multiple code units.

Iterating over a string in JavaScript yields strings representing the code points (each iteration may yield a string with more than one code units). As a result, one can also do `[...string]` and get an array of the code points. If we write a generic text position selector in terms of iteration over code points then we can compose it with anything that generates such an iterator from any other source, like generating `innerText` from a DOM Node or Range. The simplest thing is to call it with `[...string]`.

However, I think we should consider going a step further and writing a text selector that consumes an iterator that yields _chunks_ rather than receiving a full text with the initial call. This interface would be useful for streaming scenarios where the whole text may not be available or may be extremely large. The chunks themselves could be arrays or strings, and if we decide that they are strings we may wish to iterate over their code points.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org