You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Christian Reuschling <re...@dfki.uni-kl.de> on 2013/06/07 13:30:33 UTC

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

it would be very interesting if somebody has a principle comment on this thread...


On 29.05.2013 14:42, Nick Burch wrote:
> On Wed, 29 May 2013, Christian Reuschling wrote:
>> Nevertheless, in this case an Exception (like in all other parsers) or a tika body with
>> length zero, which is indicated at least by handler.endDocument() would be the appropriate
>> way, isn't it? - From the ContentHandlers point of view, there is nothing in between.
> 
> I'm not sure if we do have a properly documented policy on what a parser should do if it
> receives a file it can't handle. For ones that are invalid (eg corrupt), I believe an exception
> is the expected result. The case when the file seems valid, but can't be handled by the parser,
> not sure
> 
> Does anyone know if we have a policy on this, and/or where we should document it?
> 
> Nick

- -- 
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:reuschling@dfki.de  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp
tRAAoIbAkUjqZujYM/BHINMmbhNswir9
=a1xL
-----END PGP SIGNATURE-----

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 7 Jun 2013, Ray Gauss II wrote:
> I think the Parser interface Javadoc would make sense as a place to 
> document, but I don't know if there is an existing policy.

It might be helpful if some kind soul could take a few hours to review all 
the existing parsers, and give a summary of what they seem to do on 
invalid or empty documents (eg 5 throw a tika exception, 1 a sax 
exception, 8 do start then end, 2 do nothing). I don't know what those 
numbers will be, but that may help us work out if there's almost a 
standard we can aim for or not!

Nick

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

Posted by Ray Gauss II <ra...@alfresco.com>.
I think the Parser interface Javadoc would make sense as a place to document, but I don't know if there is an existing policy.

We'll certainly need to consider things like DelegatingParsers which may be using other parsers to do portions of the work.

Not the principle comment you were looking for, but my 2 cents.

Ray

On Jun 7, 2013, at 7:30 AM, Christian Reuschling <re...@dfki.uni-kl.de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> it would be very interesting if somebody has a principle comment on this thread...
> 
> 
> On 29.05.2013 14:42, Nick Burch wrote:
>> On Wed, 29 May 2013, Christian Reuschling wrote:
>>> Nevertheless, in this case an Exception (like in all other parsers) or a tika body with
>>> length zero, which is indicated at least by handler.endDocument() would be the appropriate
>>> way, isn't it? - From the ContentHandlers point of view, there is nothing in between.
>> 
>> I'm not sure if we do have a properly documented policy on what a parser should do if it
>> receives a file it can't handle. For ones that are invalid (eg corrupt), I believe an exception
>> is the expected result. The case when the file seems valid, but can't be handled by the parser,
>> not sure
>> 
>> Does anyone know if we have a policy on this, and/or where we should document it?
>> 
>> Nick
> 
> - -- 
> ______________________________________________________________________________
> Christian Reuschling, Dipl.-Ing.(BA)
> Software Engineer
> 
> Knowledge Management Department
> German Research Center for Artificial Intelligence DFKI GmbH
> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
> 
> Phone: +49.631.20575-1250
> mailto:reuschling@dfki.de  http://www.dfki.uni-kl.de/~reuschling/
> 
> - ------------Legal Company Information Required by German Law------------------
> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>                  Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313=
> ______________________________________________________________________________
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.19 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp
> tRAAoIbAkUjqZujYM/BHINMmbhNswir9
> =a1xL
> -----END PGP SIGNATURE-----