You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/06/29 21:19:50 UTC
[jira] Commented: (TIKA-448) Tika FLVParser hangs

    [ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883660#action_12883660 ] 

Julien Nioche commented on TIKA-448:
------------------------------------

I have seen similar cases with FLV when the content fetched by Nutch had been trimmed. Setting the log level to debug should give you more information about which URL is problematic.
One simple workaround for cases like these (apart from filtering on *.flv of course) is to use the skip record options in Hadoop 

{code}
  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
  ./nutch parse $commonOptions $skipRecordsOptions $hdfspath/segments/$SEGMENT
{code}

this will skip the problematic entries after a couple of retries.

Of course preventing the flv parser to loop would be even better. I'll see if I can reproduce the problem later

> Tika FLVParser hangs
> --------------------
>
>                 Key: TIKA-448
>                 URL: https://issues.apache.org/jira/browse/TIKA-448
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Linux JDK 1.6u13, Nutch 1.1
>            Reporter: Jeroen van Vianen
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.FilterInputStream.skip(FilterInputStream.java:125)
>         at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
>         at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
>                 // Tag was not metadata, skip over data we cannot handle
>                 for (int skiplen = 0; skiplen < datalen;) {
>                     long currentSkipLen = datainput.skip(datalen - skiplen);
>                     skiplen += currentSkipLen;
>                 }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.