You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@annotator.apache.org by Randall Leeds <ra...@apache.org> on 2019/06/14 13:29:10 UTC

Re: Experiments with quote anchoring performance and flexibility

This is great stuff, Robert. Thanks for sharing.

On Tue, May 28, 2019, 06:10 Robert Knight <ro...@hypothes.is> wrote:

>
> At the moment this is still in the experimentation phase, but if it pans
> out well, I think they could be a good fit for the Apache Annotator
> project. The string matching implementation is fairly solid and offers
> significant performance improvements over diff-match-patch (see the README
> of the anchor-quote project for some early benchmarks). The quote anchoring
> library is very much in early development.


With what you know from this implementation and the research into
algorithms you've done for it, I'm wondering if you have any early
intuitions about API design questions I've been pondering.

So far, I've been ensuring that the selector API is incremental and
asynchronous. I've also made sure to support returning multiple matches.
For these reasons the selectors are asynchronous generators.

I have not explicitly addressed sharing state between selectors. Some
algorithms may benefit from doing a parallel search for multiple patterns
in a single pass. I noticed your API accepts multiple selectors, although
it seems to treat them independently. I was wondering what considerations
you had in mind when you made that API design decision.

The API on master here currently separates the parse operation that takes a
selector string/object, returning a scoped match function. This provides an
opportunity to close over shared state held by the parser/constructor.

The parse step does not initially scope the selector to any context. It's
not possible to know, until the match function is invoked, that multiple
selectors are applicable to the same scope. I suspect this concern can be
lifted by binding the parser to a root scope. Therefore, I don't think
there's any action here, but I figured I'd mention it.

Another thing which I have considered and, again, so far made no explicit
affordance for, is threading a deadline or abort controller through the
call stack so that match operations can be cancelable.

If you have any strong reactions to these thoughts, I'd be curious to hear
them.

I'd love to find some alignment that allows you to bring this work into the
tree and invite you to contribute to directly.

I'd also consider adopting typescript, if that is appealing to the group.
That's probably a very easy change to our babel setup.

Re: Experiments with quote anchoring performance and flexibility

Posted by Robert Knight <ro...@hypothes.is>.
Hi Randall,

Thanks for the feedback, and apologies for not replying sooner.

> With what you know from this implementation and the research into
> algorithms you've done for it, I'm wondering if you have any early
> intuitions about API design questions I've been pondering.

In terms of the API design, I have learned a few things based on my prototyping so far:

- Even with very efficient and well-optimised algorithms, the process of anchoring hundreds of quotes in a document with 100s of KB of text can take 100s of milliseconds (or even seconds if the document is a large book), so the API should indeed be asynchronous.

- There are opportunities for efficiency gains by anchoring a batch of selectors at once, rather than one at a time. One easy win is to avoid building up the mapping between DOM nodes and text offsets multiple times but instead built it only once and re-use it while searching for multiple quotes. Searching for multiple patterns at a time may also offer speed improvements, although it depends on the error rate allowed (proportion of mismatch between the quote and the text). The current code on GitHub builds up the DOM => text offset mapping once and then searches for each quote one at a time. I have locally done some experiments with searching for multiple patterns at once and was able to obtain a speedup of 30-40%, but this involved some careful tuning of various parameters for the particular dataset. It isn’t obvious yet whether this can be made robust enough to deliver a benefit when the text/selector set is not known in advance.

- I would like to avoid hanging the browser UI during the anchoring process, therefore any step which can take more than say ~100ms would ideally be either incremental or possible to do in a worker. As long as the API is async, they don’t have to care which of these methods is used. Building the DOM => text offset mapping has to be done on the main thread. Once that is done, the actual pattern-matching step can be done in a worker.

- Since the process can take > 500ms, cancellation might be well useful, although I haven’t implemented this yet.

- An important feature of the prototypes is that they allow for customisable normalisation of the text and quotes before matching, together with mapping the matches in the normalised text back to locations in the original text. Normalization can include case-folding, ignoring certain characters etc. The motivation for this is primarily to help with anchoring annotations following a forthcoming PDF.js upgrade in the Hypothesis client / Via / browser extension which will significantly affect the way text is extracted for some PDFs.

> I'd love to find some alignment that allows you to bring this work into the
> tree and invite you to contribute to directly.

I’d like to move the code to a shared space in future. So far I’ve been focused on prototyping to learn about the problem space and figuring out how to meet Hypothesis’s product needs. There is still quite a bit of that to do. I will try to familiarise myself with the work that has already been done in the Apache Annotator repo to understand how this work might fit into what is there so far.

> I'd also consider adopting typescript, if that is appealing to the group.
> That's probably a very easy change to our babel setup.

I do think the group should consider this. I’ve found that types offer significant benefits for readability and maintainability of libraries, and it benefits consumers in terms of providing machine-readable interfaces (for consumers using TypeScript or Flow, code completion in IDEs etc., or even just humans reading the source directly)

Kind Regards,
Robert.


> On 14 Jun 2019, at 14:29, Randall Leeds <ra...@apache.org> wrote:
> 
> This is great stuff, Robert. Thanks for sharing.
> 
> On Tue, May 28, 2019, 06:10 Robert Knight <ro...@hypothes.is> wrote:
> 
>> 
>> At the moment this is still in the experimentation phase, but if it pans
>> out well, I think they could be a good fit for the Apache Annotator
>> project. The string matching implementation is fairly solid and offers
>> significant performance improvements over diff-match-patch (see the README
>> of the anchor-quote project for some early benchmarks). The quote anchoring
>> library is very much in early development.
> 
> 
> With what you know from this implementation and the research into
> algorithms you've done for it, I'm wondering if you have any early
> intuitions about API design questions I've been pondering.
> 
> So far, I've been ensuring that the selector API is incremental and
> asynchronous. I've also made sure to support returning multiple matches.
> For these reasons the selectors are asynchronous generators.
> 
> I have not explicitly addressed sharing state between selectors. Some
> algorithms may benefit from doing a parallel search for multiple patterns
> in a single pass. I noticed your API accepts multiple selectors, although
> it seems to treat them independently. I was wondering what considerations
> you had in mind when you made that API design decision.
> 
> The API on master here currently separates the parse operation that takes a
> selector string/object, returning a scoped match function. This provides an
> opportunity to close over shared state held by the parser/constructor.
> 
> The parse step does not initially scope the selector to any context. It's
> not possible to know, until the match function is invoked, that multiple
> selectors are applicable to the same scope. I suspect this concern can be
> lifted by binding the parser to a root scope. Therefore, I don't think
> there's any action here, but I figured I'd mention it.
> 
> Another thing which I have considered and, again, so far made no explicit
> affordance for, is threading a deadline or abort controller through the
> call stack so that match operations can be cancelable.
> 
> If you have any strong reactions to these thoughts, I'd be curious to hear
> them.
> 
> I'd love to find some alignment that allows you to bring this work into the
> tree and invite you to contribute to directly.
> 
> I'd also consider adopting typescript, if that is appealing to the group.
> That's probably a very easy change to our babel setup.


Re: Experiments with quote anchoring performance and flexibility

Posted by Randall Leeds <ra...@apache.org>.
On Fri, Jun 14, 2019, 09:29 Randall Leeds <ra...@apache.org> wrote:

> The API on master here currently separates the parse operation that takes
> a selector string/object, returning a scoped match function. This provides
> an opportunity to close over shared state held by the parser/constructor.
>
> The parse step does not initially scope the selector to any context. It's
> not possible to know, until the match function is invoked, that multiple
> selectors are applicable to the same scope. I suspect this concern can be
> lifted by binding the parser to a root scope. Therefore, I don't think
> there's any action here, but I figured I'd mention it.
>

For reference, here's the current TextQuoteSelector:

https://github.com/apache/incubator-annotator/blob/da7c578d3030a2374a374707f3f2d86f6b745669/packages/text/src/index.js

Note that there is no approximate matching implemented.

It is implemented in terms on strings only and not text nodes. I've been
trying to decide how to bridge between context formats like DOM nodes,
ranges, and strings. I like your string normalization step. I think it
could also address unicode code point vs code unit.