You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@annotator.apache.org by GitBox <gi...@apache.org> on 2020/09/16 18:03:40 UTC
[GitHub] [incubator-annotator] Treora commented on issue #85: ‘Chunking’ abstraction

Treora commented on issue #85:
URL: https://github.com/apache/incubator-annotator/issues/85#issuecomment-693569313


   > Currently, our text quote anchoring function (in the dom package) is hard-coded to search for text quote using Range, NodeIterator, TreeWalker. When using the chunk approach, this functionality should be composed of two parts: one generic text quote anchoring function that takes a stream of Chunks of text; and one dom-to-chunk converter that uses TreeWalkers and such to present the DOM as a stream of text Chunks.
   
   I started playing with this idea in the branch [chunking](https://github.com/apache/incubator-annotator/compare/chunking).
   
   In this first attempt a Chunk is anything that has a toString() method: 
   
   ```
   export interface Chunk {
     toString(): string;
   }
   ```
   
   And we can point at a part of the text using a straightforward generalisation of `Range`:
   
   ```
   export interface ChunkRange<TChunk extends Chunk> {
     startChunk: TChunk;
     startIndex: number;
     endChunk: TChunk;
     endIndex: number;
   }
   ```
   
   I made an abstracted version of the text quote matcher that accepts as its scope an `AsyncIterable<TChunk>`:
   
   ```
   export function abstractTextQuoteSelectorMatcher(
     selector: TextQuoteSelector,
   ): <TChunk extends Chunk>(textChunks: AsyncIterable<TChunk>) => AsyncIterable<ChunkRange<TChunk>> {
   ```
   
   So one can throw any type of `TChunk` in, and get ranges using the same type back. For the concrete implementation, I actually used `Range`s as the `TChunk` type, with each `Range` wrapping a single text node. (we can’t just throw in text nodes themselves, both because their `toString()` method does not return the text content, and because the first and the last node might be only partially part of the scope).
   
   For the text quote matching this works fine (all tests pass). For describing a text quote however, it would be helpful to have more freedom to navigate the text, instead of only walking through it in a single pass (especially to find prefixes). I suppose our scope should have an API that is more like a TreeWalker than like a NodeIterator: jump to any spot, and then walk in either direction.
   
   @tilgovi: Any thoughts about the approach to try, before I go further down this rabbit hole?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org