You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2007/02/01 18:49:05 UTC
[jira] Created: (JCR-728) Automatic MIME type detection
Automatic MIME type detection
-----------------------------
Key: JCR-728
URL: https://issues.apache.org/jira/browse/JCR-728
Project: Jackrabbit
Issue Type: Improvement
Components: indexing
Reporter: Jukka Zitting
Priority: Minor
Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Paco Avila (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469564 ]
Paco Avila commented on JCR-728:
--------------------------------
Why LGPL is troublesome? Source code using a LGPL library does not have to be LGPL or GPL. A port of libmagic to Java should be nice because there is lots of MIME definitions in its format.
And yes, I think that is more useful to add more functionality to jackrabbit-index-filters. By the way some MS Office files thows errors when they are indexed. I know this is a POI issue, but is this project abandoned? There is no updates since 04-08-2004 :(
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469982 ]
Jukka Zitting commented on JCR-728:
-----------------------------------
Thanks for the POI update!
A commons project for mime type detection seems a nice prospect. I'll try come up with something along these lines in near future.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Martin van den Bemt (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469973 ]
Martin van den Bemt commented on JCR-728:
-----------------------------------------
POI is active and working towards a release.. Don't know if that solves your problem though, so better file an issue / ask a question on poi-user if the problem you have is solved yet.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469696 ]
Jukka Zitting commented on JCR-728:
-----------------------------------
> Why LGPL is troublesome?
The LGPL works as intended for C code, but is troublesome for languages like Java. See http://wiki.apache.org/jakarta/Using_LGPL'd_code and the current draft of the third party license policy at
http://www.apache.org/legal/3party.html for more details.
It could be possible for us to introduce a limited LGPL dependency if there's no reasonable alternative (see the conditions on the Jakarta wiki), but I don't think jmimemagic is essential enough to justify such trouble. This is also the reason why we can't release the Hibernate persistence manager we currently have in the contrib directory.
> I know this is a POI issue, but is this project abandoned?
I've seen some activity there, but I don't know the exact status of the project. The latest Jakarta board report mentioned some conflict over the status of POI, but I hope that's been cleared. It would be nice if we didn't have to start looking for an alternative.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (JCR-728) Automatic MIME type detection
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated JCR-728:
------------------------------
Fix Version/s: (was: 2.0-beta6)
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: indexing, jackrabbit-core
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Paco Avila (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469541 ]
Paco Avila commented on JCR-728:
--------------------------------
In Linux there are lots of magic definitions in the file "/usr/share/file/magic".
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Martin van den Bemt (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469975 ]
Martin van den Bemt commented on JCR-728:
-----------------------------------------
Could be a nice project for jakarta commons automatic mime type detection ? Don't think labs is the place to do that (if I understand their charter), since you cannot do any releases over there.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (JCR-728) Automatic MIME type detection
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved JCR-728.
-------------------------------
Resolution: Duplicate
Fix Version/s: 2.0-beta6
This has been implemented as a part of the Tika integration in JCR-1878.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: indexing, jackrabbit-core
> Reporter: Jukka Zitting
> Priority: Minor
> Fix For: 2.0-beta6
>
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Paco Avila (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469540 ]
Paco Avila commented on JCR-728:
--------------------------------
I'm currently evaluating http://jmimemagic.sourceforge.net/ but seems a bit limited in some cases. But is the best option for now.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-728) Automatic MIME type detection
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469546 ]
Jukka Zitting commented on JCR-728:
-----------------------------------
I've looked at jmimemagic too, but as you mentioned, it's a bit limited. It's also licensed under the LGPL, which makes it a bit troublesome for us.
There's a recent codebase at http://hedges.net/archives/2006/11/08/java-shared-mime-info/ that seems pretty good, but the code is under the GPL.
I recently discussed with some people form Apache Nutch about a project to implement the shared mime info standard from freedesktop.org (http://www.freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec), and apparently someone already had some Apache-licensed code for that but I haven't yet seen it.
I've been planning to propose an implementation project for the mime info standard in Apache Labs (http://labs.apache.org/), but if there's more interest within the Jackrabbit community we could also start working on it within the jackrabbit-text-extractors component.
> Automatic MIME type detection
> -----------------------------
>
> Key: JCR-728
> URL: https://issues.apache.org/jira/browse/JCR-728
> Project: Jackrabbit
> Issue Type: Improvement
> Components: indexing
> Reporter: Jukka Zitting
> Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type and thus the applicable text extractor to use for indexing a document. If the jcr:mimeType property is not available or is set to a generic value like "application/octet-stream", then the indexer could also use some heuristics based on the node name or magic numbers within the binary stream to determine the type of the document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.