You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@allura.apache.org by Heith Seewald <hs...@slashdotmedia.com> on 2015/07/13 17:04:22 UTC

[allura:tickets] #7925 Speed up diff processing with binary files



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 13, 2015 03:04 PM UTC
**Owner:** Heith Seewald

In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Dave Brondsema <da...@brondsema.net>.
- **status**: in-progress --> review
- **Reviewer**: Dave Brondsema --> Heith Seewald
- **Comment**:

Fixes on `db/7925` on allura and forgehg repos.  Followup ticket [#7949] for a few items.



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Fri Jul 31, 2015 02:50 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Heith Seewald <hs...@slashdotmedia.com>.
- **labels**: sf-2, sf-current, performance --> sf-current, performance, sf-4
- **status**: review --> closed
- **Comment**:

The changes you made looked really good and over all much cleaner.

Nice work!



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** closed
**Milestone:** unreleased
**Labels:** sf-current performance sf-4 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Wed Aug 05, 2015 04:09 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Heith Seewald <hs...@slashdotmedia.com>.
- **status**: in-progress --> review
- **Comment**:

QA: **hs/7925**

Binary files should no longer make XHR requests for diff processing.



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 13, 2015 03:53 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Dave Brondsema <da...@brondsema.net>.
- **labels**:  --> sf-2, sf-current



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 13, 2015 03:04 PM UTC
**Owner:** Heith Seewald

In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Dave Brondsema <da...@brondsema.net>.
And this ticket will also resolve [#7918] too. (Although I think the `[:]` loop issue needs causing a minor bug still)


---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Thu Jul 30, 2015 10:31 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Heith Seewald <hs...@slashdotmedia.com>.
These are great notes.

I was on the fence about *--find-copies-harder*.  I ended up using it because my testing showed slightly better results when detecting copies, but I did not consider (or test for) false positives.


---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Thu Jul 30, 2015 10:38 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Dave Brondsema <da...@brondsema.net>.
- **status**: review --> in-progress
- **Comment**:

The results here are great.  Including the repo refresh backend logic.  But it is several changes and some quite big changes, and so naturally there's a good handful of tweaks needed to polish it up: 

#### general
* Now the commit view doesn't show binary diffs, good.  But the table listing all the files has binary files linked up still and the links don't go anywhere.
* Can you add a test for the `has_html_view` method's new functionality for fast binary detection?
* "refresh" logic is fast now too, yay!
* I guess this should be a separate ticket, but it'd be nice to sort by filename across all change types, instead of showing adds, then removes, etc.  Maybe same ticket as displaying copies vs renames better.
* Down in the diff list, it says "File was copied or renamed."  We should be able to say exactly which now.
* A rename shows up as `{'new': u'README.txt', 'old': u'README', 'diff': '', 'ratio': 1}` in the diff section and also says `Can't load diff`
    * Is it ok that we set diff to `''` in many places?

#### hg & svn
* The `[:]` slice would be better on the `for` loop than the `if` line right?

#### hg
* cleanup: move imports to top of file

#### git
* Testing with walrustech repo, in the 2nd commit, only the `Flan` dir shows up as having changes.  Nothing shown for `options.txt` or `bin/` or `mods/` but they did have changes.  You can see this with ?limit=1000.  And if you use the default limit, the pages at the end are all blank.
* I think we don't want to use `--find-copies-harder`
    * Performance wise on a big repo my timing measurement is 0m0.035s without it and 0m0.135s with it.  Noticable but not huge
    * A bigger impact is the semantics of it.  It can make an incorrect association of files being "copied" if the contents are common contents.  A very good example of common contents is no content, an empty file.  I've found a diff that says one `__init__.py` file was copied to another, but really it's just a new file.  And another file that is new but has a lot of test boilerplate so git thinks its a 56% similar copy.  Thus I think we should drop `--find-copies-harder`
* After doing a straight copy or rename in git and committing it, I get:

~~~~
File '/home/dbrondsema/dbrondsema-1019/forge/ForgeGit/forgegit/model/git_repo.py', line 682 in paged_diffs
  for i in xrange(0, result['total'] + 1, 2)]
IndexError: list index out of range
~~~~





---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 27, 2015 08:28 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Dave Brondsema <da...@brondsema.net>.
- **labels**: sf-2, sf-current --> sf-2, sf-current, performance
- **status**: review --> in-progress
- **Reviewer**: Dave Brondsema
- **Comment**:

I think we need to skip binaries sooner, not just on the display side.  Server-side time plus the background task to "refresh" the repo takes forever still.  Like inside the `_diffs_copied` method.

Also we can do better text detection, using the existing `has_html_view` method.  It checks several things to determine if its text.  Might want to rename the method, or alias it, though.



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Thu Jul 16, 2015 07:10 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Heith Seewald <hs...@slashdotmedia.com>.
- **status**: in-progress --> review
- **Comment**:

Good notes.  Based on your feedback I refactored paged_diffs to now rely on the SCM system.

QA at: 

hs/7925
&  
hs/7925 on forgehg


Other notes:
Git has a few other interesting options for tweaking performance.  For example -- we could use a diff processing threshold when searching for copies.

We also could further improve the visual indicators when displaying copies vs renames etc (but that may be better in another ticket).



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Tue Jul 21, 2015 09:29 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7925 Speed up diff processing with binary files

Posted by Dave Brondsema <da...@brondsema.net>.
- **labels**: sf-current, performance, sf-4 --> performance, sf-4



---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** closed
**Milestone:** unreleased
**Labels:** performance sf-4 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Aug 10, 2015 01:42 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] Re: #7925 Speed up diff processing with binary files

Posted by Heith Seewald <hs...@slashdotmedia.com>.
We could specify the max number for the -C option.  We could also make this configurable via the ini.


**git diff-tree:**

**-l***<num>*

*The -M and -C options require O(n^2) processing time where n is the number of potential rename/copy targets. This option prevents rename/copy detection from running if the number of rename/copy targets exceeds the specified number.*


---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** review
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 27, 2015 08:28 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be very inefficient. We should test if a file is binary and exclude it from the diff processing section.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.