You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2011/12/04 01:32:16 UTC

parsers implementations for media files (mpeg, flv, webm)

 I don't see media (mpeg, flv, webm files) parsers implementation of
the Parser interface
~
 http://tika.apache.org/1.0/api/org/apache/tika/parser/Parser.html
~
 Am I wrong? or is this a design decision?
~
 At least you may want to get their metadata
~
 lbrtchx

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Albretch Mueller <lb...@gmail.com>.

~
 thanks to ffmpeg developers (specially Stefano Sabatini) tika's media
handling is basically done.
~
 All that is needed is using a schema based xml handling to get the
metadata, which IMHO is the nicest way we could have dreamed of having
metadata served to tika
~
 lbrtchx

On 12/31/11, Albretch Mueller <lb...@gmail.com> wrote:
>  guys,
> ~
>  I just wanted to let you know that I haven't forgotten working on
> tika's media branch. The thing is that ffmpeg developers (specially
> Stefano Sabatini) have been nicely receptive and helpful:
> ~
>  http://ffmpeg-users.933282.n4.nabble.com/
> ~
>  : getting some xml-ish dump while ffprobing a media file's metadata?
> ~
>  : standard err or out?
> ~
>  and also, well, you know, we are all busy doing other things ;-), so.
> I have had to change my starting code a few times. I just went monkey
> and did things the way I thought (without reading your code base),
> because I wanted to thoroughly test ffmpeg output first (to the extent
> that java would let you). Basically I used some thread-base code to
> get the standard output and error stream
> ~
> // __
> class StreamGrabber02 extends Thread{
>  InputStream IS;
>  private String aStringedStream;
>  private final int iSz = 8192;
> // __
>  StreamGrabber02(InputStream IS, String aThrdNm){
>   this.IS = IS;
>   setName(aThrdNm + "_" + this);
>  }
> // __
>  public void run(){
>   int iL = 0;
>   ByteArrayOutputStream BAOS = null;
>   byte[] bAr = new byte[iSz];
>   try{
>    BAOS = new ByteArrayOutputStream();
>    while ((iL = IS.read(bAr, 0, iSz)) != -1){ BAOS.write(bAr, 0, iL);
> }  BAOS.flush();
>    this.aStringedStream = new String(BAOS.toByteArray());
>   }catch(IOException IOX){
>     System.err.println("// __ thread name: |" + getName() + "|");
>     IOX.printStackTrace(System.err);
>    }
>  }
> // __
>  public String getStringedStream(){ return(this.aStringedStream); }
> }
> ~
>  and running ffprobe like this
> ~
>  Process Prx = RTm.exec((new String[]{”ffprobe”, "-unit",
> "-show_streams", "-show_format", "-loglevel", "-loglevel", <media
> input file>}));
> ~
>  InputStream ISErr = Prx.getErrorStream();
> ~
>  StreamGrabber02 SGrbErr = new StreamGrabber02(ISErr, aThrdNm +
> ".err"); SGrbErr.start();
> ~
>  InputStream ISIn= Prx.getInputStream();
>  StreamGrabber02 SGrbIn = new StreamGrabber02(ISIn, aThrdNm + ".in");
> SGrbIn.start();
> // __
>  iExitVal = Prx.waitFor();
>  SGrbErr.join();  SGrbIn.join();
>  ISIn.close();  ISErr.close();
>  Prx.destroy();
> ~
>  then, depending on there being errors or not, the exit code of the
> errors and the extension of the file, I logged the stream outputs to
> different folders/files
> ~
> // __
>  if(iExitVal != 0){
>   ODir = new File(ODirErrs, (new Integer(iExitVal)).toString());
>   if(!ODir.exists()){ ODir.mkdirs(); }
>   OFl = new File(ODir, "_" + aX + ".log.txt");
>   OSWrtr = new OutputStreamWriter(new FileOutputStream(OFl, true),
> "UTF-8");
>   OSWrtr.write(aB.toString());
>   OSWrtr.close();
>  }
>  else{
>   ODir = new File(ODirOK, (new Integer(iExitVal)).toString());
>   if(!ODir.exists()){ ODir.mkdirs(); }
>   OFl = new File(ODir, "_" + aX + ".log.txt");
>   OSWrtr = new OutputStreamWriter(new FileOutputStream(OFl, true),
> "UTF-8");
>   OSWrtr.write(aB.toString());
>   OSWrtr.close();
> // __
>   ++lFlsOK;
>  }
> ~
>  In order to do functional and stress testing, I had no option but
> abusing http://samples.mplayerhq.hu/00-README. I wouldn't provide you
> with stress test results right now, because ffmpeg developers have
> including a flag to ffprobe's output as schema-based xml and I am
> again/still streamlining things
> ~
>  I think I should work next on making this starting code more kosher
> and align it with tika's base. I notice there is going to be
> "cultural" issues (other than offering/maintaining the code) if the
> decision is made to go ahead and use ffmpeg as our underlying library
> for users of the supported operating systems (how to install it and
> such things)
> ~
>  Can you just tell me which class/documentation based on a similar
> library should I use as example?
> ~
>  Is it OK for me to just post the whole code listing here or some
> public site before you decide (or not) to make me one of that code
> base committers?
> ~
>  Any other advice about how to proceed next?
> ~
>  lbrtchx
>

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Jukka Zitting <ju...@gmail.com>.

Hi Albert,

On Sat, Dec 31, 2011 at 7:27 PM, Albretch Mueller <lb...@gmail.com> wrote:
>  I think I should work next on making this starting code more kosher
> and align it with tika's base. I notice there is going to be
> "cultural" issues (other than offering/maintaining the code) if the
> decision is made to go ahead and use ffmpeg as our underlying library
> for users of the supported operating systems (how to install it and
> such things)

Do you already have a Tika patch you're working on? The best way to
align your code with Tika is to share it as early as possible for
review and feedback.

> Can you just tell me which class/documentation based on a similar
> library should I use as example?

See the org.apache.tika.parser.external.ExternalParser class in
tika-core for generic code for using an external program as a Tika
parsers. You probably need some customization for interpreting the
output from ffmpeg, but the basic structure of the code should be the
same.

> Is it OK for me to just post the whole code listing here or some
> public site before you decide (or not) to make me one of that code
> base committers?

The best would be for you to file a feature request or this in the
Tika issue tracker at https://issues.apache.org/jira/browse/TIKA. Then
attach a patch (see
http://www.apache.org/dev/contributors.html#patches for instructions)
for review.

See also http://community.apache.org/contributors/index.html for how
the path from a contributor to a committer works at Apache.

BR,

Jukka Zitting

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Albretch Mueller <lb...@gmail.com>.

 guys,
~
 I just wanted to let you know that I haven't forgotten working on
tika's media branch. The thing is that ffmpeg developers (specially
Stefano Sabatini) have been nicely receptive and helpful:
~
 http://ffmpeg-users.933282.n4.nabble.com/
~
 : getting some xml-ish dump while ffprobing a media file's metadata?
~
 : standard err or out?
~
 and also, well, you know, we are all busy doing other things ;-), so.
I have had to change my starting code a few times. I just went monkey
and did things the way I thought (without reading your code base),
because I wanted to thoroughly test ffmpeg output first (to the extent
that java would let you). Basically I used some thread-base code to
get the standard output and error stream
~
// __
class StreamGrabber02 extends Thread{
 InputStream IS;
 private String aStringedStream;
 private final int iSz = 8192;
// __
 StreamGrabber02(InputStream IS, String aThrdNm){
  this.IS = IS;
  setName(aThrdNm + "_" + this);
 }
// __
 public void run(){
  int iL = 0;
  ByteArrayOutputStream BAOS = null;
  byte[] bAr = new byte[iSz];
  try{
   BAOS = new ByteArrayOutputStream();
   while ((iL = IS.read(bAr, 0, iSz)) != -1){ BAOS.write(bAr, 0, iL);
}  BAOS.flush();
   this.aStringedStream = new String(BAOS.toByteArray());
  }catch(IOException IOX){
    System.err.println("// __ thread name: |" + getName() + "|");
    IOX.printStackTrace(System.err);
   }
 }
// __
 public String getStringedStream(){ return(this.aStringedStream); }
}
~
 and running ffprobe like this
~
 Process Prx = RTm.exec((new String[]{”ffprobe”, "-unit",
"-show_streams", "-show_format", "-loglevel", "-loglevel", <media
input file>}));
~
 InputStream ISErr = Prx.getErrorStream();
~
 StreamGrabber02 SGrbErr = new StreamGrabber02(ISErr, aThrdNm +
".err"); SGrbErr.start();
~
 InputStream ISIn= Prx.getInputStream();
 StreamGrabber02 SGrbIn = new StreamGrabber02(ISIn, aThrdNm + ".in");
SGrbIn.start();
// __
 iExitVal = Prx.waitFor();
 SGrbErr.join();  SGrbIn.join();
 ISIn.close();  ISErr.close();
 Prx.destroy();
~
 then, depending on there being errors or not, the exit code of the
errors and the extension of the file, I logged the stream outputs to
different folders/files
~
// __
 if(iExitVal != 0){
  ODir = new File(ODirErrs, (new Integer(iExitVal)).toString());
  if(!ODir.exists()){ ODir.mkdirs(); }
  OFl = new File(ODir, "_" + aX + ".log.txt");
  OSWrtr = new OutputStreamWriter(new FileOutputStream(OFl, true), "UTF-8");
  OSWrtr.write(aB.toString());
  OSWrtr.close();
 }
 else{
  ODir = new File(ODirOK, (new Integer(iExitVal)).toString());
  if(!ODir.exists()){ ODir.mkdirs(); }
  OFl = new File(ODir, "_" + aX + ".log.txt");
  OSWrtr = new OutputStreamWriter(new FileOutputStream(OFl, true), "UTF-8");
  OSWrtr.write(aB.toString());
  OSWrtr.close();
// __
  ++lFlsOK;
 }
~
 In order to do functional and stress testing, I had no option but
abusing http://samples.mplayerhq.hu/00-README. I wouldn't provide you
with stress test results right now, because ffmpeg developers have
including a flag to ffprobe's output as schema-based xml and I am
again/still streamlining things
~
 I think I should work next on making this starting code more kosher
and align it with tika's base. I notice there is going to be
"cultural" issues (other than offering/maintaining the code) if the
decision is made to go ahead and use ffmpeg as our underlying library
for users of the supported operating systems (how to install it and
such things)
~
 Can you just tell me which class/documentation based on a similar
library should I use as example?
~
 Is it OK for me to just post the whole code listing here or some
public site before you decide (or not) to make me one of that code
base committers?
~
 Any other advice about how to proceed next?
~
 lbrtchx

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Nick Burch <ni...@alfresco.com>.

On 05/12/11 21:41, Albretch Mueller wrote:
>>   If you're interested in helping ...
>
>  Yes, I can and would offer man/mind hours to including movie media
> files parsing (and eventually processing) in tika

Great!

>  I am definitely more inclined to use ffmpeg (your third option) but I
> think we should carefully think about and probably use more than one
> option. There already is a Java port of parts of the FFMPEG project
> (jffmpeg.sourceforge.net) but as you may know already ;-) its
> licensing is messy

Something like ffmpeg (via external) or jffmpeg wouldn't be able to be 
included in the core product anyway, because of licensing reasons. 
They'd have to be maintained at least partly externally, so there would 
be nothing to stop people picking the right one for them.

(Possibly the code to talk to ffmpeg could be included in core, with the 
user responsible for downloading ffmpeg to use it, but code to talk to 
jffmpeg would need to be external)

>   About your second option all the info is in the containers anyway,
> codecs are just encoded data

Alas not really, at least not the way users seem to think of things...

Consider this example. We have a mpeg container. Within it we find 4 mp3 
audio streams (with different languages tagged), and 2 5.1 channel ogg 
vorbis audio streams (same language). We also find 2 subtitle streams, 
and 3 mpeg2 video streams (1 at a higher bitrate to the other).

If you ask the user, that's a mpeg2 video with 2 alternate camera 
angles, high quality english audio, high quality english directors 
description, and translated audio.

If we don't understand the codecs, we can't figure out what streams are 
at what bitrates, which ones are video and which are audio etc. 
Especially one some container formats (ogg springs to mind) which are 
very general, the container provides framing info but you need to know 
about the codecs to figure out what is in it.

The first step is going to be to make sure we can recognise all the 
different media containers (there's something like 6-10 of them), as 
we'll need that to know if we should handle them or not. Next we'd want 
to understand the basics of the container, to pull out any metadata we 
can do about it. Finally we'd need to implement basic metadata 
extractors for the key codecs (we already have this for some of the 
audio formats) so we can get info on what's in the container.

>   Could you guide me/us of a running list of what you think needs to be done?

First up, I'd say one thing to do is come up with some (very small!) 
sample files in the different formats. Initially just one per container, 
but ideally also some with different contents too. (For example, both 
mpeg with mpeg2+mp3, and mpeg with mpeg2+mp3+mp3+ogg)

Next, using these sample files, we need to ensure that we have mime 
magic for all the container formats, along with unit tests. We'll also 
need to sort out mimetypes for the common combinations, and maybe also 
think about how to describe some of the cases (do we always call it 
after the biggest video format for example? Do we care about the container?)

Now, if we wanted to go down the ffmpeg external processor route, we 
need to do two sets of mappings. One is from "ffmpeg -formats" to 
mimetypes, so that our parser can correctly claim the mimetypes it can 
handle. In addition, we need to work out how to map the output of 
"ffmpeg -i" back to our (often new) mimetypes, so we can have a detector 
based on it

> I know there are developers extracting the sequences of images of the
> subtitles and using OCR to change them to text ... Any one could see
> how useful such a thing could be. Could tika reach out to those deep
> waters?

Let's have some more progress on the regular OCR stuff first, then we 
can worry about extracting out subtitles and finally figure out how to 
OCR it... :)

Nick

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Albretch Mueller <lb...@gmail.com>.

>  If you're interested in helping ...
~
 Yes, I can and would offer man/mind hours to including movie media
files parsing (and eventually processing) in tika
~
 I am definitely more inclined to use ffmpeg (your third option) but I
think we should carefully think about and probably use more than one
option. There already is a Java port of parts of the FFMPEG project
(jffmpeg.sourceforge.net) but as you may know already ;-) its
licensing is messy
~
 About your second option all the info is in the containers anyway,
codecs are just encoded data
~
 the box I am using right now:
~
$ ffmpeg -version
ffmpeg version 0.7.1-4:0.7.1-5, Copyright (c) 2000-2011 the Libav developers
  built on Sep  5 2011 06:18:41 with gcc 4.6.1
ffmpeg 0.7.1-4:0.7.1-5
libavutil    51.  7. 0 / 51.  7. 0
libavcodec   53.  5. 0 / 53.  5. 0
libavformat  53.  2. 0 / 53.  2. 0
libavdevice  53.  0. 0 / 53.  0. 0
libavfilter   2.  4. 0 /  2.  4. 0
libswscale    2.  0. 0 /  2.  0. 0
libpostproc  52.  0. 0 / 52.  0. 0
~
 supports (handling of) subtitles for the following formats:
~
$ ffmpeg -codecs | grep VSD
 DEVSD  ffvhuff         Huffyuv FFmpeg variant
 DEVSD  flv             Flash Video (FLV) / Sorenson Spark / Sorenson H.263
 DEVSDT h263            H.263 / H.263-1996
 D VSD  h263i           Intel H.263
 DEVSD  huffyuv         Huffyuv / HuffYUV
 DEVSDT mpeg1video      MPEG-1 video
 DEVSDT mpeg2video      MPEG-2 video
 DEVSDT mpeg4           MPEG-4 part 2
 D VSDT mpegvideo       MPEG-1 video
 D VSDT mpegvideo_xvmc  MPEG-1/2 video XvMC (X-Video Motion Compensation)
 DEVSD  msmpeg4         MPEG-4 part 2 Microsoft variant version 3
 D VSD  msmpeg4v1       MPEG-4 part 2 Microsoft variant version 1
 DEVSD  msmpeg4v2       MPEG-4 part 2 Microsoft variant version 2
 D VSD  svq3            Sorenson Vector Quantizer 3 / Sorenson Video 3 / SVQ3
 D VSD  theora          Theora
 D VSD  vp3             On2 VP3
 DEVSD  wmv1            Windows Media Video 7
 DEVSD  wmv2            Windows Media Video 8
~
 Could you guide me/us of a running list of what you think needs to be done?
~
 I know there are developers extracting the sequences of images of the
subtitles and using OCR to change them to text ... Any one could see
how useful such a thing could be. Could tika reach out to those deep
waters?
~
 The thing is that virtually anything is offered nowadays in some for of media
~
 lbrtchx

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Nick Burch <ni...@alfresco.com>.

On Sun, 4 Dec 2011, Albretch Mueller wrote:
> I don't see media (mpeg, flv, webm files) parsers implementation of
> the Parser interface

The issue is that for most of these, we'd need one of:
  * A suitably licensed Java library to use
  * To write our own Java code to parse the containers and codecs
  * To call our to something on the command line (eg ffmpeg) to do that
    work for us

For the 1st option, I'm not aware of there being many Java libraries for 
the various containers and codecs of interest, let alone ones under a 
suitable license. Please shout if you know of some though!

For the 2nd option, it could be done, but it'd be a lot of work. (I speak 
as the person who's written the Ogg and FLAC parsers, and worked quite a 
bit on the MP3 one)

The third option is my chosen one, but it's not quite ready. I got much of 
it done in the summer, but haven't had a chance to finish it off.... My 
idea was to extend the ExternalParser's support, than provide config to 
call out to ffmpeg (probably "ffmpeg -i <file>") to get the metadata on 
it. If you're interested in helping, I can let you know some of the tasks 
(mapping from "ffmpeg -formats" to supported mimetypes being one)

Nick

Re: parsers implementations for media files (mpeg, flv, webm)

Posted by Albretch Mueller <lb...@gmail.com>.

 OK I found:
~
 http://tika.apache.org/1.0/api/org/apache/tika/parser/video/FLVParser.html
~
 but where are the implementations for the other video files? ;-)
~
 lbrtchx