You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Nick Lothian <nl...@educationau.edu.au> on 2009/02/18 07:22:48 UTC

Reading metadata without downloading entire file

I'm trying to get MP3 Metadata without downloading an entire MP3.

I've setup a FilterInputStream which throws an InterruptedIOException after a given amount of a file is downloaded.

If I point this at an HTML page it works - I can get the title from the metadata.

If I point it at an MP3 file it doesn't give me any metadata at all (except the Metadata.RESOURCE_NAME_KEY which I set), even if I set the download length to be just less than the length of the file. If I download the whole file it works

(JPGs don't seem to work either)

Why is this so? My understanding was that Tika would work with streams?


Code:

                CountingInputStream stream = new CountingInputStream(method.getResponseBodyAsStream(), new CountingListener() {
                        public void transferred(long amount, InputStream theStream) throws InterruptedIOException {
                                if (amount > 20000l) {
                                        throw new InterruptedIOException();
                                }
                        }
                });


                Metadata metadata = new Metadata();
                metadata.set(Metadata.RESOURCE_NAME_KEY, address);
                try {
                        parser.parse(stream, getXmlContentHandler(), metadata);
                } catch (Exception e) {
                        e.printStackTrace();
                } finally {
                        System.out.println("size = " + stream.getTransferred());
                        stream.close();
                }
                System.out.println(Arrays.toString(metadata.names()));


Regards
  Nick Lothian

IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of Education.au except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.

RE: Reading metadata without downloading entire file

Posted by Nick Lothian <nl...@educationau.edu.au>.

ID3v2 would be great - it appears ID3v1 is widely used in music MP3 files, but not in Podcast MP3s.

Anyway, if anyone is having a similar problem here's some code which appears to work using Apache HttpClient.

Http Range requests for MP3 metadata:

                HttpClient httpClient = new HttpClient();
                httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(10000);
                httpClient.getHttpConnectionManager().getParams().setSoTimeout(10000);

                String address = "http://address of mp3 file here";

                HttpMethod method = new HeadMethod();
                method.setURI(new URI(address,true));

                Header contentLengthHeader = null;
                Header acceptHeader = null;

                httpClient.executeMethod(method);
                try {
                        //System.out.println(Arrays.toString(method.getResponseHeaders()));
                        contentLengthHeader = method.getResponseHeader("Content-Length");
                        acceptHeader = method.getResponseHeader("Accept-Ranges");
                } finally {
                        method.releaseConnection();
                }

                if ((contentLengthHeader != null) && (acceptHeader != null) && "bytes".equals(acceptHeader.getValue())) {
                        long contentLength = Long.parseLong(contentLengthHeader.getValue());
                        long metaDataStartRange = contentLength - 128;
                        if (metaDataStartRange > 0) {
                                method = new GetMethod();
                                method.setURI(new URI(address,true));
                                method.addRequestHeader("Range", "bytes=" + metaDataStartRange + "-" + contentLength);
                                System.out.println(Arrays.toString(method.getRequestHeaders()));
                                httpClient.executeMethod(method);
                                try {
                                        Parser parser = new AutoDetectParser();

                                        Metadata metadata = new Metadata();
                                        metadata.set(Metadata.RESOURCE_NAME_KEY, address);
                                        InputStream stream = method.getResponseBodyAsStream();
                                        try {
                                                parser.parse(stream, new DefaultHandler(), metadata);
                                        } catch (Exception e) {
                                                e.printStackTrace();
                                        } finally {
                                                stream.close();
                                        }
                                        System.out.println(Arrays.toString(metadata.names()));
                                        System.out.println("Title: " + metadata.get("title"));
                                        System.out.println("Author: " + metadata.get("Author"));
                                } finally {
                                        method.releaseConnection();
                                }
                        }
                } else {
                        System.err.println("Range not supported. Headers were: ");
                        System.err.println(Arrays.toString(method.getResponseHeaders()));
                }


-----Original Message-----
From: Jonathan Koren [mailto:jonathan@soe.ucsc.edu]
Sent: Thursday, 19 February 2009 8:44 AM
To: tika-dev@lucene.apache.org
Subject: Re: Reading metadata without downloading entire file

id3v1 is exactly 128 bytes [ http://en.wikipedia.org/wiki/ID3#Layout ]
In my copious free time, I might add id3v2 support, unless of course
some else does.

On Feb 18, 2009, at 2:04 PM, Nick Lothian wrote:

> Well that would explain it then!
>
> Has anyone had any experience with using http-range requests for the
> metadata? How many bytes from the end does the metadata start?
>
> Nick
>
> -----Original Message-----
> From: Jonathan Koren [mailto:jonathan@soe.ucsc.edu]
> Sent: Wednesday, 18 February 2009 5:30 PM
> To: tika-dev@lucene.apache.org
> Subject: Re: Reading metadata without downloading entire file
>
>
> You're closing the stream before the metadata arrives.
>
> Tika supports ID3v1 which is at the end of the file, not the
> beginning.
>
> On Feb 17, 2009, at 10:22 PM, Nick Lothian wrote:
>
>> I'm trying to get MP3 Metadata without downloading an entire MP3.
>>
>> I've setup a FilterInputStream which throws an
>> InterruptedIOException after a given amount of a file is downloaded.
>>
>> If I point this at an HTML page it works - I can get the title from
>> the metadata.
>>
>> If I point it at an MP3 file it doesn't give me any metadata at all
>> (except the Metadata.RESOURCE_NAME_KEY which I set), even if I set
>> the download length to be just less than the length of the file. If
>> I download the whole file it works
>>
>> (JPGs don't seem to work either)
>>
>> Why is this so? My understanding was that Tika would work with
>> streams?
>
>
>
> --
> Jonathan Koren
> jonathan@soe.ucsc.edu
> http://www.soe.ucsc.edu/~jonathan/
>
>
>
> IMPORTANT: This e-mail, including any attachments, may contain
> private or confidential information. If you think you may not be the
> intended recipient, or if you have received this e-mail in error,
> please contact the sender immediately and delete all copies of this
> e-mail. If you are not the intended recipient, you must not
> reproduce any part of this e-mail or disclose its contents to any
> other party. This email represents the views of the individual
> sender, which do not necessarily reflect those of Education.au
> except where the sender expressly states otherwise. It is your
> responsibility to scan this email and any files transmitted with it
> for viruses or any other defects. education.au limited will not be
> liable for any loss, damage or consequence caused directly or
> indirectly by this email.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of Education.au except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.

Re: Reading metadata without downloading entire file

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.

id3v1 is exactly 128 bytes [ http://en.wikipedia.org/wiki/ID3#Layout ]
In my copious free time, I might add id3v2 support, unless of course  
some else does.

On Feb 18, 2009, at 2:04 PM, Nick Lothian wrote:

> Well that would explain it then!
>
> Has anyone had any experience with using http-range requests for the  
> metadata? How many bytes from the end does the metadata start?
>
> Nick
>
> -----Original Message-----
> From: Jonathan Koren [mailto:jonathan@soe.ucsc.edu]
> Sent: Wednesday, 18 February 2009 5:30 PM
> To: tika-dev@lucene.apache.org
> Subject: Re: Reading metadata without downloading entire file
>
>
> You're closing the stream before the metadata arrives.
>
> Tika supports ID3v1 which is at the end of the file, not the  
> beginning.
>
> On Feb 17, 2009, at 10:22 PM, Nick Lothian wrote:
>
>> I'm trying to get MP3 Metadata without downloading an entire MP3.
>>
>> I've setup a FilterInputStream which throws an
>> InterruptedIOException after a given amount of a file is downloaded.
>>
>> If I point this at an HTML page it works - I can get the title from
>> the metadata.
>>
>> If I point it at an MP3 file it doesn't give me any metadata at all
>> (except the Metadata.RESOURCE_NAME_KEY which I set), even if I set
>> the download length to be just less than the length of the file. If
>> I download the whole file it works
>>
>> (JPGs don't seem to work either)
>>
>> Why is this so? My understanding was that Tika would work with
>> streams?
>
>
>
> --
> Jonathan Koren
> jonathan@soe.ucsc.edu
> http://www.soe.ucsc.edu/~jonathan/
>
>
>
> IMPORTANT: This e-mail, including any attachments, may contain  
> private or confidential information. If you think you may not be the  
> intended recipient, or if you have received this e-mail in error,  
> please contact the sender immediately and delete all copies of this  
> e-mail. If you are not the intended recipient, you must not  
> reproduce any part of this e-mail or disclose its contents to any  
> other party. This email represents the views of the individual  
> sender, which do not necessarily reflect those of Education.au  
> except where the sender expressly states otherwise. It is your  
> responsibility to scan this email and any files transmitted with it  
> for viruses or any other defects. education.au limited will not be  
> liable for any loss, damage or consequence caused directly or  
> indirectly by this email.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/

RE: Reading metadata without downloading entire file

Posted by Nick Lothian <nl...@educationau.edu.au>.

Well that would explain it then!

Has anyone had any experience with using http-range requests for the metadata? How many bytes from the end does the metadata start?

Nick

-----Original Message-----
From: Jonathan Koren [mailto:jonathan@soe.ucsc.edu]
Sent: Wednesday, 18 February 2009 5:30 PM
To: tika-dev@lucene.apache.org
Subject: Re: Reading metadata without downloading entire file

You're closing the stream before the metadata arrives.

Tika supports ID3v1 which is at the end of the file, not the beginning.

On Feb 17, 2009, at 10:22 PM, Nick Lothian wrote:

> I'm trying to get MP3 Metadata without downloading an entire MP3.
>
> I've setup a FilterInputStream which throws an
> InterruptedIOException after a given amount of a file is downloaded.
>
> If I point this at an HTML page it works - I can get the title from
> the metadata.
>
> If I point it at an MP3 file it doesn't give me any metadata at all
> (except the Metadata.RESOURCE_NAME_KEY which I set), even if I set
> the download length to be just less than the length of the file. If
> I download the whole file it works
>
> (JPGs don't seem to work either)
>
> Why is this so? My understanding was that Tika would work with
> streams?

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/

IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of Education.au except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.

Re: Reading metadata without downloading entire file

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.

You're closing the stream before the metadata arrives.

Tika supports ID3v1 which is at the end of the file, not the beginning.

On Feb 17, 2009, at 10:22 PM, Nick Lothian wrote:

> I'm trying to get MP3 Metadata without downloading an entire MP3.
>
> I've setup a FilterInputStream which throws an  
> InterruptedIOException after a given amount of a file is downloaded.
>
> If I point this at an HTML page it works - I can get the title from  
> the metadata.
>
> If I point it at an MP3 file it doesn't give me any metadata at all  
> (except the Metadata.RESOURCE_NAME_KEY which I set), even if I set  
> the download length to be just less than the length of the file. If  
> I download the whole file it works
>
> (JPGs don't seem to work either)
>
> Why is this so? My understanding was that Tika would work with  
> streams?



--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/