You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ma...@Automationdirect.com on 2013/03/04 22:29:01 UTC

Parsing error for video wmv files

Hi,

I am using Nutch 1.5.1 and I am trying to crawl and parse video/mp4, video/x-ms-wmv. I do not see any mp4 files being fetched or parsed and  I am getting following error for a wmv file in the logs:

Error parsing: http://www.server-abc.com/Darpa_Video_Final.wmv: failed(2,0): Can't retrieve Tika parser for mime-type video/x-ms-wmv

Here is my regex-urlfilter.txt configuration file:
-\(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

Parse-plugins.xml has following:

<mimeType name="video/x-ms-wmv">
   <plugin id="parse-tika" />
</mimeType>

<mimeType name="video/mp4">
   <plugin id="parse-tika" />
</mimeType>

Is there anything else I need to check or missing? Does the http.accept property need to have all the mime types that can be accepted? I am going to try and add it next after my current crawl finishes.  Any help will be greatly appreciated.

Thanks,
Madhvi



Re: Parsing error for video wmv files

Posted by ma...@Automationdirect.com.
Thanks Tejas. I am trying with tags in video files.

On 3/6/13 12:27 PM, "Tejas Patil" <te...@gmail.com> wrote:

>I am not aware of any java library which you can use for parsing wmv.
>Nutch
>currently has parser for swf and mostly delegates parsing to Tika.
>Typically video files are not crawled by search engines. Only their meta
>information is useful.
>
>
>On Wed, Mar 6, 2013 at 7:51 AM, <ma...@automationdirect.com> wrote:
>
>> Thank you so much Tejas. That explains the wmv parsing error. I thought
>> that video/mp4 could run an Adobe Flash but I am not sure. I am
>>inquiring
>> from our company's media expert. Since Tika only parses flash files is
>> there any other plugin available that we can use?
>>
>> On 3/4/13 11:04 PM, "Tejas Patil" <te...@gmail.com> wrote:
>>
>> >[0] says that Tika 1.2 can only parse flash videos and no other video
>>file
>> >formats.
>> >
>> >[0] : http://tika.apache.org/1.2/formats.html#Video_formats
>> >
>> >
>> >On Mon, Mar 4, 2013 at 1:29 PM, <ma...@automationdirect.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I am using Nutch 1.5.1 and I am trying to crawl and parse video/mp4,
>> >> video/x-ms-wmv. I do not see any mp4 files being fetched or parsed
>>and
>> >>I
>> >> am getting following error for a wmv file in the logs:
>> >>
>> >> Error parsing: http://www.server-abc.com/Darpa_Video_Final.wmv:
>> >> failed(2,0): Can't retrieve Tika parser for mime-type video/x-ms-wmv
>> >>
>> >> Here is my regex-urlfilter.txt configuration file:
>> >>
>> >>
>> 
>>>>-\(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|
>>>>ZI
>> 
>>>>P|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|B
>>>>MP
>> >>|js|JS)$
>> >>
>> >> Parse-plugins.xml has following:
>> >>
>> >> <mimeType name="video/x-ms-wmv">
>> >>    <plugin id="parse-tika" />
>> >> </mimeType>
>> >>
>> >> <mimeType name="video/mp4">
>> >>    <plugin id="parse-tika" />
>> >> </mimeType>
>> >>
>> >> Is there anything else I need to check or missing? Does the
>>http.accept
>> >> property need to have all the mime types that can be accepted? I am
>> >>going
>> >> to try and add it next after my current crawl finishes.  Any help
>>will
>> >>be
>> >> greatly appreciated.
>> >>
>> >> Thanks,
>> >> Madhvi
>> >>
>> >>
>> >>
>>
>>


Re: Parsing error for video wmv files

Posted by Tejas Patil <te...@gmail.com>.
I am not aware of any java library which you can use for parsing wmv. Nutch
currently has parser for swf and mostly delegates parsing to Tika.
Typically video files are not crawled by search engines. Only their meta
information is useful.


On Wed, Mar 6, 2013 at 7:51 AM, <ma...@automationdirect.com> wrote:

> Thank you so much Tejas. That explains the wmv parsing error. I thought
> that video/mp4 could run an Adobe Flash but I am not sure. I am inquiring
> from our company's media expert. Since Tika only parses flash files is
> there any other plugin available that we can use?
>
> On 3/4/13 11:04 PM, "Tejas Patil" <te...@gmail.com> wrote:
>
> >[0] says that Tika 1.2 can only parse flash videos and no other video file
> >formats.
> >
> >[0] : http://tika.apache.org/1.2/formats.html#Video_formats
> >
> >
> >On Mon, Mar 4, 2013 at 1:29 PM, <ma...@automationdirect.com> wrote:
> >
> >> Hi,
> >>
> >> I am using Nutch 1.5.1 and I am trying to crawl and parse video/mp4,
> >> video/x-ms-wmv. I do not see any mp4 files being fetched or parsed and
> >>I
> >> am getting following error for a wmv file in the logs:
> >>
> >> Error parsing: http://www.server-abc.com/Darpa_Video_Final.wmv:
> >> failed(2,0): Can't retrieve Tika parser for mime-type video/x-ms-wmv
> >>
> >> Here is my regex-urlfilter.txt configuration file:
> >>
> >>
> >>-\(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZI
> >>P|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP
> >>|js|JS)$
> >>
> >> Parse-plugins.xml has following:
> >>
> >> <mimeType name="video/x-ms-wmv">
> >>    <plugin id="parse-tika" />
> >> </mimeType>
> >>
> >> <mimeType name="video/mp4">
> >>    <plugin id="parse-tika" />
> >> </mimeType>
> >>
> >> Is there anything else I need to check or missing? Does the http.accept
> >> property need to have all the mime types that can be accepted? I am
> >>going
> >> to try and add it next after my current crawl finishes.  Any help will
> >>be
> >> greatly appreciated.
> >>
> >> Thanks,
> >> Madhvi
> >>
> >>
> >>
>
>

Re: Parsing error for video wmv files

Posted by ma...@Automationdirect.com.
Thank you so much Tejas. That explains the wmv parsing error. I thought
that video/mp4 could run an Adobe Flash but I am not sure. I am inquiring
from our company's media expert. Since Tika only parses flash files is
there any other plugin available that we can use?

On 3/4/13 11:04 PM, "Tejas Patil" <te...@gmail.com> wrote:

>[0] says that Tika 1.2 can only parse flash videos and no other video file
>formats.
>
>[0] : http://tika.apache.org/1.2/formats.html#Video_formats
>
>
>On Mon, Mar 4, 2013 at 1:29 PM, <ma...@automationdirect.com> wrote:
>
>> Hi,
>>
>> I am using Nutch 1.5.1 and I am trying to crawl and parse video/mp4,
>> video/x-ms-wmv. I do not see any mp4 files being fetched or parsed and
>>I
>> am getting following error for a wmv file in the logs:
>>
>> Error parsing: http://www.server-abc.com/Darpa_Video_Final.wmv:
>> failed(2,0): Can't retrieve Tika parser for mime-type video/x-ms-wmv
>>
>> Here is my regex-urlfilter.txt configuration file:
>>
>> 
>>-\(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZI
>>P|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP
>>|js|JS)$
>>
>> Parse-plugins.xml has following:
>>
>> <mimeType name="video/x-ms-wmv">
>>    <plugin id="parse-tika" />
>> </mimeType>
>>
>> <mimeType name="video/mp4">
>>    <plugin id="parse-tika" />
>> </mimeType>
>>
>> Is there anything else I need to check or missing? Does the http.accept
>> property need to have all the mime types that can be accepted? I am
>>going
>> to try and add it next after my current crawl finishes.  Any help will
>>be
>> greatly appreciated.
>>
>> Thanks,
>> Madhvi
>>
>>
>>


Re: Parsing error for video wmv files

Posted by Tejas Patil <te...@gmail.com>.
[0] says that Tika 1.2 can only parse flash videos and no other video file
formats.

[0] : http://tika.apache.org/1.2/formats.html#Video_formats


On Mon, Mar 4, 2013 at 1:29 PM, <ma...@automationdirect.com> wrote:

> Hi,
>
> I am using Nutch 1.5.1 and I am trying to crawl and parse video/mp4,
> video/x-ms-wmv. I do not see any mp4 files being fetched or parsed and  I
> am getting following error for a wmv file in the logs:
>
> Error parsing: http://www.server-abc.com/Darpa_Video_Final.wmv:
> failed(2,0): Can't retrieve Tika parser for mime-type video/x-ms-wmv
>
> Here is my regex-urlfilter.txt configuration file:
>
> -\(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> Parse-plugins.xml has following:
>
> <mimeType name="video/x-ms-wmv">
>    <plugin id="parse-tika" />
> </mimeType>
>
> <mimeType name="video/mp4">
>    <plugin id="parse-tika" />
> </mimeType>
>
> Is there anything else I need to check or missing? Does the http.accept
> property need to have all the mime types that can be accepted? I am going
> to try and add it next after my current crawl finishes.  Any help will be
> greatly appreciated.
>
> Thanks,
> Madhvi
>
>
>