You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andre-John Mas (JIRA)" <ji...@apache.org> on 2011/01/28 22:44:45 UTC

[jira] Created: (TIKA-590) Create facility for deeper introspection of media files

Create facility for deeper introspection of media files
-------------------------------------------------------

                 Key: TIKA-590
                 URL: https://issues.apache.org/jira/browse/TIKA-590
             Project: Tika
          Issue Type: Wish
          Components: metadata
            Reporter: Andre-John Mas


This feature would allow applications to dig deeper into files to define meta-data that is not presented as a tag in the file. For example a file that has no duration information could with a little more work provide this missing information. The idea is to let the API user make a difference between data that is quick to retrieve and data that is slower to retrieve because of the extra processing needed to get that information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-590) Create facility for deeper introspection of media files

Posted by "Andre-John Mas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988619#action_12988619 ] 

Andre-John Mas commented on TIKA-590:
-------------------------------------

Some cases I see:
 - hash
 - duration of song or movie
 - language tracks in movie 

I have looked into doing this with the mp3 file format, but in doing so I see it would require a second pass over the inputstream and in certain cases would need to make use of other libraries. For this reason I wondering whether an extension architecture would be needed? Imagine using a native library on certain platforms such as libvlc. 

> Create facility for deeper introspection of media files
> -------------------------------------------------------
>
>                 Key: TIKA-590
>                 URL: https://issues.apache.org/jira/browse/TIKA-590
>             Project: Tika
>          Issue Type: Wish
>          Components: metadata
>            Reporter: Andre-John Mas
>
> This feature would allow applications to dig deeper into files to define meta-data that is not presented as a tag in the file. For example a file that has no duration information could with a little more work provide this missing information. The idea is to let the API user make a difference between data that is quick to retrieve and data that is slower to retrieve because of the extra processing needed to get that information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-590) Create facility for deeper introspection of media files

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988495#action_12988495 ] 

Nick Burch commented on TIKA-590:
---------------------------------

I'm not sure how that would fit into the current model. However, something similar that might work is setting something in the parse context to indicate how much work you'd like the parsers to do

A rough idea would be something like:
public enum ParserExtraWorkLevel  { NONE, LIMITED, FULL }

parseContext.set(ParserExtraWorkLevel.class, ParserExtraWorkLevel.FULL)
parser.parse(stream, handler, metadata, parseContext);

Then inside the parser you could check for the extra work level, and do more if requested.

It's probably worth coming up with a concrete case first though, and when we have a patch that introduces some optional "expensive" work to a parser we can decide on the best way forward.

> Create facility for deeper introspection of media files
> -------------------------------------------------------
>
>                 Key: TIKA-590
>                 URL: https://issues.apache.org/jira/browse/TIKA-590
>             Project: Tika
>          Issue Type: Wish
>          Components: metadata
>            Reporter: Andre-John Mas
>
> This feature would allow applications to dig deeper into files to define meta-data that is not presented as a tag in the file. For example a file that has no duration information could with a little more work provide this missing information. The idea is to let the API user make a difference between data that is quick to retrieve and data that is slower to retrieve because of the extra processing needed to get that information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-590) Create facility for deeper introspection of media files

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988741#comment-12988741 ] 

Nick Burch commented on TIKA-590:
---------------------------------

TikaInputStream can help with the case of needing to do multiple passes over the stream.

For the libvlc vs java case, you'd probably want something like:
* A libvlc powered movie parser (mixture of java and native code)
* A pure java "switching" parser - eg will use the normal java parser if ParserExtraWorkLevel is none or limited, and will use the vlc one for FULL assuming it loads ok
You could then choose to use the switching+vlc one or not by including / not including the jar



> Create facility for deeper introspection of media files
> -------------------------------------------------------
>
>                 Key: TIKA-590
>                 URL: https://issues.apache.org/jira/browse/TIKA-590
>             Project: Tika
>          Issue Type: Wish
>          Components: metadata
>            Reporter: Andre-John Mas
>
> This feature would allow applications to dig deeper into files to define meta-data that is not presented as a tag in the file. For example a file that has no duration information could with a little more work provide this missing information. The idea is to let the API user make a difference between data that is quick to retrieve and data that is slower to retrieve because of the extra processing needed to get that information.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira