You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@allura.apache.org by Heith Seewald <hs...@slashdotmedia.com> on 2015/07/13 17:04:22 UTC
[allura:tickets] #7925 Speed up diff processing with binary files
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** in-progress
**Milestone:** unreleased
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 13, 2015 03:04 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Dave Brondsema <da...@brondsema.net>.
- **status**: in-progress --> review
- **Reviewer**: Dave Brondsema --> Heith Seewald
- **Comment**:
Fixes on `db/7925` on allura and forgehg repos. Followup ticket [#7949] for a few items.
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Fri Jul 31, 2015 02:50 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Heith Seewald <hs...@slashdotmedia.com>.
- **labels**: sf-2, sf-current, performance --> sf-current, performance, sf-4
- **status**: review --> closed
- **Comment**:
The changes you made looked really good and over all much cleaner.
Nice work!
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** closed
**Milestone:** unreleased
**Labels:** sf-current performance sf-4
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Wed Aug 05, 2015 04:09 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Heith Seewald <hs...@slashdotmedia.com>.
- **status**: in-progress --> review
- **Comment**:
QA: **hs/7925**
Binary files should no longer make XHR requests for diff processing.
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 13, 2015 03:53 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Dave Brondsema <da...@brondsema.net>.
- **labels**: --> sf-2, sf-current
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 13, 2015 03:04 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Dave Brondsema <da...@brondsema.net>.
And this ticket will also resolve [#7918] too. (Although I think the `[:]` loop issue needs causing a minor bug still)
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Thu Jul 30, 2015 10:31 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Heith Seewald <hs...@slashdotmedia.com>.
These are great notes.
I was on the fence about *--find-copies-harder*. I ended up using it because my testing showed slightly better results when detecting copies, but I did not consider (or test for) false positives.
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Thu Jul 30, 2015 10:38 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Dave Brondsema <da...@brondsema.net>.
- **status**: review --> in-progress
- **Comment**:
The results here are great. Including the repo refresh backend logic. But it is several changes and some quite big changes, and so naturally there's a good handful of tweaks needed to polish it up:
#### general
* Now the commit view doesn't show binary diffs, good. But the table listing all the files has binary files linked up still and the links don't go anywhere.
* Can you add a test for the `has_html_view` method's new functionality for fast binary detection?
* "refresh" logic is fast now too, yay!
* I guess this should be a separate ticket, but it'd be nice to sort by filename across all change types, instead of showing adds, then removes, etc. Maybe same ticket as displaying copies vs renames better.
* Down in the diff list, it says "File was copied or renamed." We should be able to say exactly which now.
* A rename shows up as `{'new': u'README.txt', 'old': u'README', 'diff': '', 'ratio': 1}` in the diff section and also says `Can't load diff`
* Is it ok that we set diff to `''` in many places?
#### hg & svn
* The `[:]` slice would be better on the `for` loop than the `if` line right?
#### hg
* cleanup: move imports to top of file
#### git
* Testing with walrustech repo, in the 2nd commit, only the `Flan` dir shows up as having changes. Nothing shown for `options.txt` or `bin/` or `mods/` but they did have changes. You can see this with ?limit=1000. And if you use the default limit, the pages at the end are all blank.
* I think we don't want to use `--find-copies-harder`
* Performance wise on a big repo my timing measurement is 0m0.035s without it and 0m0.135s with it. Noticable but not huge
* A bigger impact is the semantics of it. It can make an incorrect association of files being "copied" if the contents are common contents. A very good example of common contents is no content, an empty file. I've found a diff that says one `__init__.py` file was copied to another, but really it's just a new file. And another file that is new but has a lot of test boilerplate so git thinks its a 56% similar copy. Thus I think we should drop `--find-copies-harder`
* After doing a straight copy or rename in git and committing it, I get:
~~~~
File '/home/dbrondsema/dbrondsema-1019/forge/ForgeGit/forgegit/model/git_repo.py', line 682 in paged_diffs
for i in xrange(0, result['total'] + 1, 2)]
IndexError: list index out of range
~~~~
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 27, 2015 08:28 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Dave Brondsema <da...@brondsema.net>.
- **labels**: sf-2, sf-current --> sf-2, sf-current, performance
- **status**: review --> in-progress
- **Reviewer**: Dave Brondsema
- **Comment**:
I think we need to skip binaries sooner, not just on the display side. Server-side time plus the background task to "refresh" the repo takes forever still. Like inside the `_diffs_copied` method.
Also we can do better text detection, using the existing `has_html_view` method. It checks several things to determine if its text. Might want to rename the method, or alias it, though.
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Thu Jul 16, 2015 07:10 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Heith Seewald <hs...@slashdotmedia.com>.
- **status**: in-progress --> review
- **Comment**:
Good notes. Based on your feedback I refactored paged_diffs to now rely on the SCM system.
QA at:
hs/7925
&
hs/7925 on forgehg
Other notes:
Git has a few other interesting options for tweaking performance. For example -- we could use a diff processing threshold when searching for copies.
We also could further improve the visual indicators when displaying copies vs renames etc (but that may be better in another ticket).
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Tue Jul 21, 2015 09:29 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] #7925 Speed up diff processing with binary files
Posted by Dave Brondsema <da...@brondsema.net>.
- **labels**: sf-current, performance, sf-4 --> performance, sf-4
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** closed
**Milestone:** unreleased
**Labels:** performance sf-4
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Aug 10, 2015 01:42 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
[allura:tickets] Re: #7925 Speed up diff processing with binary files
Posted by Heith Seewald <hs...@slashdotmedia.com>.
We could specify the max number for the -C option. We could also make this configurable via the ini.
**git diff-tree:**
**-l***<num>*
*The -M and -C options require O(n^2) processing time where n is the number of potential rename/copy targets. This option prevents rename/copy detection from running if the number of rename/copy targets exceeds the specified number.*
---
** [tickets:#7925] Speed up diff processing with binary files**
**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 27, 2015 08:28 PM UTC
**Owner:** Heith Seewald
In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.
---
Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/
To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.