You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2007/02/01 18:49:05 UTC

[jira] Created: (JCR-728) Automatic MIME type detection

Automatic MIME type detection
-----------------------------

                 Key: JCR-728
                 URL: https://issues.apache.org/jira/browse/JCR-728
             Project: Jackrabbit
          Issue Type: Improvement
          Components: indexing
            Reporter: Jukka Zitting
            Priority: Minor


Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Paco Avila (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469564 ] 

Paco Avila commented on JCR-728:
--------------------------------

Why LGPL is troublesome? Source code using a LGPL library does not have to be LGPL or GPL. A port of libmagic to Java should be nice because there is lots of MIME definitions in its format.

And yes, I think that is more useful to add more functionality to jackrabbit-index-filters. By the way some MS Office files thows errors when they are indexed. I know this is a POI issue, but is this project abandoned? There is no updates since 04-08-2004 :(

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469982 ] 

Jukka Zitting commented on JCR-728:
-----------------------------------

Thanks for the POI update!

A commons project for mime type detection seems a nice prospect. I'll try come up with something along these lines in near future.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Martin van den Bemt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469973 ] 

Martin van den Bemt commented on JCR-728:
-----------------------------------------

POI is active and working towards a release.. Don't know if that solves your problem though, so better file an issue / ask a question on poi-user if the problem you have is solved yet.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469696 ] 

Jukka Zitting commented on JCR-728:
-----------------------------------

> Why LGPL is troublesome?

The LGPL works as intended for C code, but is  troublesome for languages like Java. See http://wiki.apache.org/jakarta/Using_LGPL'd_code and the current draft of the third party license policy at 
http://www.apache.org/legal/3party.html for more details.

It could be possible for us to introduce a limited LGPL dependency if there's no reasonable alternative (see the conditions on the Jakarta wiki), but I don't think jmimemagic is essential enough to justify such trouble. This is also the reason why we can't release the Hibernate persistence manager we currently have in the contrib directory.

> I know this is a POI issue, but is this project abandoned?

I've seen some activity there, but I don't know the exact status of the project. The latest Jakarta board report mentioned some conflict over the status of POI, but I hope that's been cleared. It would be nice if we didn't have to start looking for an alternative.


> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-728) Automatic MIME type detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-728:
------------------------------

    Fix Version/s:     (was: 2.0-beta6)

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Paco Avila (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469541 ] 

Paco Avila commented on JCR-728:
--------------------------------

In Linux there are lots of magic definitions in the file "/usr/share/file/magic".

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Martin van den Bemt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469975 ] 

Martin van den Bemt commented on JCR-728:
-----------------------------------------

Could be a nice project for jakarta commons automatic mime type detection ? Don't think labs is the place to do that (if I understand their charter), since you cannot do any releases over there.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (JCR-728) Automatic MIME type detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved JCR-728.
-------------------------------

       Resolution: Duplicate
    Fix Version/s: 2.0-beta6

This has been implemented as a part of the Tika integration in JCR-1878.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 2.0-beta6
>
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Paco Avila (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469540 ] 

Paco Avila commented on JCR-728:
--------------------------------

I'm currently evaluating http://jmimemagic.sourceforge.net/ but seems a bit limited in some cases. But is the best option for now.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-728) Automatic MIME type detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469546 ] 

Jukka Zitting commented on JCR-728:
-----------------------------------

I've looked at jmimemagic too, but as you mentioned, it's a bit limited. It's also licensed under the LGPL, which makes it a bit troublesome for us.

There's a recent codebase at http://hedges.net/archives/2006/11/08/java-shared-mime-info/ that seems pretty good, but the code is under the GPL.

I recently discussed with some people form Apache Nutch about a project to implement the shared mime info standard from freedesktop.org (http://www.freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec), and apparently someone already had some Apache-licensed code for that but I haven't yet seen it.

I've been planning to propose an implementation project for the mime info standard in Apache Labs (http://labs.apache.org/), but if there's more interest within the Jackrabbit community we could also start working on it within the jackrabbit-text-extractors component.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.