You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@annotator.apache.org by GitBox <gi...@apache.org> on 2020/07/16 16:39:07 UTC

[GitHub] [incubator-annotator] Treora opened a new issue #83: Fuzzy text quote matching

Treora opened a new issue #83:
URL: https://github.com/apache/incubator-annotator/issues/83


   Many annotation tools want to match a quote also when it has been modified slightly, but we have yet to implement this.
   
   Enabling approximate/fuzzy string matching could be an option to our existing implementation (perhaps a parameter that tells how fuzzy the match should be, where 0 means exact matching); alternatively it could be exposed as a separate implementation.
   
   A second question is whether the matcher should return information about the quality of the match, and if so, how the API would look. I suppose a match object could have an extra attribute expressing the ‘match quality’; though in case of refined/range selectors we should figure out how to propagate this information.
   
   Prior art we could borrow from:
   - https://github.com/tilgovi/dom-anchor-text-quote (uses diff-match-patch)
   - https://github.com/robertknight/approx-string-match-js/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-annotator] robertknight edited a comment on issue #83: Fuzzy text quote matching

Posted by GitBox <gi...@apache.org>.
robertknight edited a comment on issue #83:
URL: https://github.com/apache/incubator-annotator/issues/83#issuecomment-751125582


   > CC @robertknight (happy to hear if you have more research notes, experimental results or other relevant tips from your experience developing this!)
   
   Aside from taking ideas from Hypothesis's technical implementation, which you've already posted pointers to, the other resource I would suggest to make use of from Hypothesis are datasets of annotations in the "Public" channel. Here are some I found useful for testing quote matching performance and accuracy:
   
   - The American Yawp project: http://www.americanyawp.com. In particular, the early chapters have a lot of public annotations
   - Annotations on Wikipedia: `https://hypothes.is/search?q=url%3Ahttps%3A%2F%2Fen.wikipedia.org%2F*`. In particular, check out articles which have many annotations made on older versions (say from 2018 or earlier) and have had many edits since then
   
   The new quote matching implementation in Hypothesis has a couple of areas where we've noticed matching quality can be improved:
   
   1. It can find spurious matches for short quotes (in particular, those of one or two words). In [the PR](https://github.com/hypothesis/client/pull/2779) I mention a couple of examples.
   2. In the case where the match is not exact, alignment can be sub-optimal in some cases. Looking at public Hypothesis annotations on http://www.americanyawp.com/text/01-the-new-world/ for example you can find cases where the Hypothesis client draws highlights that start or end in unlikely places (eg. middle of a word).
   
   Related to point (1), one of the goals of the new implementation was to try to make it easier for other Hypothesis developers and staff to understand how exactly the "fuzzy" aspect of "fuzzy matching" works. The thinking is that if it is imperfect, then there is value in at least being predictable.
   
   In terms of performance, the new implementation is indeed a lot faster in the worst case where there are many selectors that either do not match at all or match with significant edits. The actual approximate string matching code is pretty well optimized at this point. The lowest-hanging fruit is optimizing the extraction of text from the document and mapping between text positions and DOM (node, offset) points. If we find that we need to make significant improvements from the current implementation in future then we'd likely need to do some offline processing of the document text before searching for matches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-annotator] robertknight edited a comment on issue #83: Fuzzy text quote matching

Posted by GitBox <gi...@apache.org>.
robertknight edited a comment on issue #83:
URL: https://github.com/apache/incubator-annotator/issues/83#issuecomment-751125582


   > CC @robertknight (happy to hear if you have more research notes, experimental results or other relevant tips from your experience developing this!)
   
   Aside from taking ideas from Hypothesis's technical implementation, which you've already posted pointers to, the other resource I would suggest to make use of from Hypothesis are datasets of annotations in the "Public" channel. Here are some I found useful for testing quote matching performance and accuracy:
   
   - The American Yawp project: http://www.americanyawp.com. In particular, the early chapters have a lot of public annotations
   - Annotations on Wikipedia: https://hypothes.is/search?q=url%3Ahttps%3A%2F%2Fen.wikipedia.org%2F*. In particular, check out articles which have many annotations made on older versions (say from 2018 or earlier) and have had many edits since then
   
   The new quote matching implementation in Hypothesis has a couple of areas where we've noticed matching quality can be improved:
   
   1. It can find spurious matches for short quotes (in particular, those of one or two words). In [the PR](https://github.com/hypothesis/client/pull/2779) I mention a couple of examples.
   2. In the case where the match is not exact, alignment can be sub-optimal in some cases. Looking at public Hypothesis annotations on http://www.americanyawp.com/text/01-the-new-world/ for example you can find cases where the Hypothesis client draws highlights that start or end in unlikely places (eg. middle of a word).
   
   Related to point (1), one of the goals of the new implementation was to try to make it easier for other Hypothesis developers and staff to understand how exactly the "fuzzy" aspect of "fuzzy matching" works. The thinking is that if it is imperfect, then there is value in at least being predictable.
   
   In terms of performance, the new implementation is indeed a lot faster in the worst case where there are many selectors that either do not match at all or match with significant edits. The actual approximate string matching code is pretty well optimized at this point. The lowest-hanging fruit is optimizing the extraction of text from the document and mapping between text positions and DOM (node, offset) points. If we find that we need to make significant improvements from the current implementation in future then we'd likely need to do some offline processing of the document text before searching for matches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-annotator] robertknight edited a comment on issue #83: Fuzzy text quote matching

Posted by GitBox <gi...@apache.org>.
robertknight edited a comment on issue #83:
URL: https://github.com/apache/incubator-annotator/issues/83#issuecomment-751125582


   > CC @robertknight (happy to hear if you have more research notes, experimental results or other relevant tips from your experience developing this!)
   
   Aside from taking ideas from Hypothesis's technical implementation, which you've already posted pointers to, the other resource I would suggest to make use of from Hypothesis are datasets of annotations in the "Public" channel. Here are some I found useful for testing quote matching performance and accuracy:
   
   - The American Yawp project: http://www.americanyawp.com. In particular, the early chapters have a lot of public annotations
   - Annotations on Wikipedia: `https://hypothes.is/search?q=url%3Ahttps%3A%2F%2Fen.wikipedia.org%2F*`. In particular, check out articles which have many annotations made on older versions (say from 2018 or earlier) and have had many edits since then
   
   The new quote matching implementation in Hypothesis has a couple of areas where we've noticed matching quality can be improved:
   
   1. It can find spurious matches for short quotes (in particular, those of one or two words). In [the PR](https://github.com/hypothesis/client/pull/2779) I mention a couple of examples.
   2. In the case where the match is not exact, alignment can be sub-optimal in some cases. Looking at public Hypothesis annotations on http://www.americanyawp.com/text/01-the-new-world/ for example you can find cases where the Hypothesis client draws highlights that start or end in unlikely places (eg. middle of a word).
   
   Related to point (1), one of the goals of the new implementation was to try to make it easier for other Hypothesis developers and staff to understand how exactly the "fuzzy" aspect of "fuzzy matching" works. The thinking is that if it is imperfect, then there is value in at least being predictable.
   
   In terms of performance, the new implementation is indeed a lot faster in the worst case where there are many selectors that either do not match at all or match with significant edits. The actual approximate string matching code is pretty well optimized at this point. The next lowest-hanging fruit is optimizing the extraction of text from the document and mapping between text positions and DOM (node, offset) points. If we find that we need to make significant improvements from the current implementation in future then we'd likely need to do some offline processing of the document text before searching for matches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-annotator] robertknight commented on issue #83: Fuzzy text quote matching

Posted by GitBox <gi...@apache.org>.
robertknight commented on issue #83:
URL: https://github.com/apache/incubator-annotator/issues/83#issuecomment-751125582


   > CC @robertknight (happy to hear if you have more research notes, experimental results or other relevant tips from your experience developing this!)
   
   Aside from taking ideas from Hypothesis's technical implementation, which @Treora already posted pointers to, the other resource I would suggest to make use of from Hypothesis are datasets of annotations in the "Public" channel. Here are some I found useful for testing quote matching performance and accuracy:
   
   - The American Yawp project: http://www.americanyawp.com. In particular, the early chapters have a lot of public annotations
   - Annotations on Wikipedia: https://hypothes.is/search?q=url%3Ahttps%3A%2F%2Fen.wikipedia.org%2F*. In particular, check out articles which have many annotations made on older versions (say from 2018 or earlier) and have had many edits since then
   
   The new quote matching implementation in Hypothesis has a couple of areas where we've noticed matching quality can be improved:
   
   1. It can find spurious matches for short quotes (in particular, those of one or two words). In [the PR](https://github.com/hypothesis/client/pull/2779) I mention a couple of examples.
   2. In the case where the match is not exact, alignment can be sub-optimal in some cases. Looking at public Hypothesis annotations on http://www.americanyawp.com/text/01-the-new-world/ for example you can find cases where the Hypothesis client draws highlights that start or end in unlikely places (eg. middle of a word).
   
   Related to point (1), one of the goals of the new implementation was to try to make it easier for other Hypothesis developers and staff to understand how exactly the "fuzzy" aspect of "fuzzy matching" works. The thinking is that if it is imperfect, then there is value in at least being predictable.
   
   In terms of performance, the new implementation is indeed a lot faster in the worst case where there are many selectors that either do not match at all or match with significant edits. The actual approximate string matching code is pretty well optimized at this point. The lowest-hanging fruit is optimizing the extraction of text from the document and mapping between text positions and DOM (node, offset) points. If we find that we need to make significant improvements from the current implementation in future then we'd likely need to do some offline processing of the document text before searching for matches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-annotator] Treora commented on issue #83: Fuzzy text quote matching

Posted by GitBox <gi...@apache.org>.
Treora commented on issue #83:
URL: https://github.com/apache/incubator-annotator/issues/83#issuecomment-750943153


   > Prior art we could borrow from:
   > 
   > * https://github.com/tilgovi/dom-anchor-text-quote (uses diff-match-patch)
   > * https://github.com/robertknight/approx-string-match-js/
   
   Update: @judell kindly informed us that Hypothesis now switched from the former to the latter. See https://github.com/hypothesis/client/pull/2814 and https://github.com/hypothesis/client/pull/2779
   
   I’d be eager to see how they compare (at least it’s supposed to be much faster now!), what could still be improved, etc.
   
   Some observations from looking at how exactly approx-string-match-js is being used in H:
   
   - A nifty choice is that its new implementation [gives a score using weights][1] such that it is fuzzier when matching the prefix and suffix than when it matches the exact quote.
   - Also it allows giving a hint at which position in the text the quote is expected, giving a penalty to matches that are further away from that position; this elegantly enables combining the information from a TextQuoteSelector and TextPositionSelector.
   - The score for each matched string (before weighting) seems to be [calculated][2] as 1 minus the number of errors (i.e. its levenshtein distance) normalised by the string’s length. This makes sense, though I wonder if it might cause e.g. a one-character prefix to have an unfairly heavy influence on the match score (perhaps not significant, but dropping this thought here for later).
   
   CC @robertknight (happy to hear if you have more research notes, experimental results or other relevant tips from your experience developing this!)
   
   [1]: https://github.com/hypothesis/client/blob/0c2871ab98e6cf0a2bbfdb4d0aba439a3ba9039a/src/annotator/anchoring/match-quote.js#L109-L145
   [2]: https://github.com/hypothesis/client/blob/0c2871ab98e6cf0a2bbfdb4d0aba439a3ba9039a/src/annotator/anchoring/match-quote.js#L55-L64


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org