You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@allura.apache.org by Heith Seewald <hs...@slashdotmedia.com> on 2015/08/12 17:25:11 UTC

[allura:tickets] #7962 Better binary file detection



---

** [tickets:#7962] Better binary file detection**

**Status:** open
**Milestone:** unreleased
**Created:** Wed Aug 12, 2015 03:25 PM UTC by Heith Seewald
**Last Updated:** Wed Aug 12, 2015 03:25 PM UTC
**Owner:** nobody


Improve our binary/text file detection.

[here is an example](https://sourceforge.net/p/planetexpress/git/ci/ba49bf3d9b3185ea2b0dc5cb6f7a3f8a6781f0c4/) of a jpg with a ".d" extention that made it through the **has_html_view** function(  `allura.model.repository.Blob#has_html_view`)


Performance should be a primary consideration because of the large number of calls on bigger commits.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7962 Better binary file detection

Posted by Dave Brondsema <da...@brondsema.net>.
We should test the https://pypi.python.org/pypi/binaryornot/ library and see if its better than our current use of the `magic` standard library.



---

** [tickets:#7962] Better binary file detection**

**Status:** open
**Milestone:** unreleased
**Created:** Wed Aug 12, 2015 03:25 PM UTC by Heith Seewald
**Last Updated:** Wed Aug 12, 2015 03:25 PM UTC
**Owner:** nobody


Improve our binary/text file detection.

[here is an example](https://sourceforge.net/p/planetexpress/git/ci/ba49bf3d9b3185ea2b0dc5cb6f7a3f8a6781f0c4/) of a jpg with a ".d" extention that made it through the **has_html_view** function(  `allura.model.repository.Blob#has_html_view`)


Performance should be a primary consideration because of the large number of calls on bigger commits.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7962 Better binary file detection

Posted by Igor Bondarenko <je...@gmail.com>.
- **status**: in-progress --> open
- **Comment**:

Ok, leaving as-is for now



---

** [tickets:#7962] Better binary file detection**

**Status:** open
**Milestone:** unreleased
**Labels:** 42cc 
**Created:** Wed Aug 12, 2015 03:25 PM UTC by Heith Seewald
**Last Updated:** Wed Oct 21, 2015 02:38 PM UTC
**Owner:** Igor Bondarenko


Improve our binary/text file detection.

[here is an example](https://sourceforge.net/p/planetexpress/git/ci/ba49bf3d9b3185ea2b0dc5cb6f7a3f8a6781f0c4/) of a jpg with a ".d" extention that made it through the **has_html_view** function(  `allura.model.repository.Blob#has_html_view`)


Performance should be a primary consideration because of the large number of calls on bigger commits.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] Re: #7962 Better binary file detection

Posted by Dave Brondsema <da...@brondsema.net>.
Good analysis, I guess the .d extension indeed was a misleading example.

After `has_html_view` if it returns True, the next logic often is to read the contents so we can display the file.  Or run a diff or something like that.  In the cases where we're reading the file anyway, we could do a content-based check to confirm it is text.  For simple file display that could work.  Diff/commit logic might be trickier to sort out.

I'm thinking its pretty good as-is for now, and we were mostly just confused by the .d extension.


---

** [tickets:#7962] Better binary file detection**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** 42cc 
**Created:** Wed Aug 12, 2015 03:25 PM UTC by Heith Seewald
**Last Updated:** Wed Oct 21, 2015 02:38 PM UTC
**Owner:** Igor Bondarenko


Improve our binary/text file detection.

[here is an example](https://sourceforge.net/p/planetexpress/git/ci/ba49bf3d9b3185ea2b0dc5cb6f7a3f8a6781f0c4/) of a jpg with a ".d" extention that made it through the **has_html_view** function(  `allura.model.repository.Blob#has_html_view`)


Performance should be a primary consideration because of the large number of calls on bigger commits.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.

[allura:tickets] #7962 Better binary file detection

Posted by Igor Bondarenko <je...@gmail.com>.
Current algorithm for `has_html_view` looks something like this:

1. Guess mimetype based on filename only (`mimetypes.guess_type`)
2. If it is `text/*` assume we can display it
3. If not - check various lists of "viewable extensions", if one of them contains extension of given filename assume we can display it
4. If not - guess file type based on content (using `python-magic`) if it is text assume we can display it

The problem with the example above is that ".d" is a valid extension for D programming language source files. On step (1) it is detected as such (mimetype `text/x-dsrc`).

One way to fix this is always check content of the file if we think it is a text file and we can display it, but it will slow down things significantly. On my local machine doing content-based check is 184 times slower in the best case scenario (120ms vs. 0.65ms). Thus it will be slower for every viewable file (and most of them are, in a typical repo).

I have checked `binaryornot` and `filemagic` libraries, but they're working with the same speed as `python-magic`, which we're using now. Most of the time is probably spent accessing the filesystem and not actualy guessing file's type.

I can't think of any way to exclude false positives without a performance penalty. Any thoughts?

To reduce performance penalty we can check file content only for files with more than one dot in the filename (like `2.jpg.d`), but it seems like very poor heuristic to me.


---

** [tickets:#7962] Better binary file detection**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** 42cc 
**Created:** Wed Aug 12, 2015 03:25 PM UTC by Heith Seewald
**Last Updated:** Tue Aug 18, 2015 08:56 AM UTC
**Owner:** Igor Bondarenko


Improve our binary/text file detection.

[here is an example](https://sourceforge.net/p/planetexpress/git/ci/ba49bf3d9b3185ea2b0dc5cb6f7a3f8a6781f0c4/) of a jpg with a ".d" extention that made it through the **has_html_view** function(  `allura.model.repository.Blob#has_html_view`)


Performance should be a primary consideration because of the large number of calls on bigger commits.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.