You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeroen van Vianen (JIRA)" <ji...@apache.org> on 2010/06/29 19:23:51 UTC

[jira] Created: (TIKA-448) Tika FLVParser hangs

Tika FLVParser hangs
--------------------

                 Key: TIKA-448
                 URL: https://issues.apache.org/jira/browse/TIKA-448
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
         Environment: Linux JDK 1.6u13, Nutch 1.1
            Reporter: Jeroen van Vianen


I am crawling a site with Nutch and creating an index using SOLR.

After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:

"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FilterInputStream.skip(FilterInputStream.java:125)
        at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:

                // Tag was not metadata, skip over data we cannot handle
                for (int skiplen = 0; skiplen < datalen;) {
                    long currentSkipLen = datainput.skip(datalen - skiplen);
                    skiplen += currentSkipLen;
                }

As I don't know which FLV is downloaded that caused the problem I cannot easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-448) Tika FLVParser hangs

Posted by "Jeroen van Vianen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeroen van Vianen updated TIKA-448:
-----------------------------------

    Description: 
I am crawling a site with Nutch and creating an index using SOLR.

After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:

"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FilterInputStream.skip(FilterInputStream.java:125)
        at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:

                // Tag was not metadata, skip over data we cannot handle
                for (int skiplen = 0; skiplen < datalen;) {
                    long currentSkipLen = datainput.skip(datalen - skiplen);
                    skiplen += currentSkipLen;
                }

As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.

  was:
I am crawling a site with Nutch and creating an index using SOLR.

After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:

"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FilterInputStream.skip(FilterInputStream.java:125)
        at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:

                // Tag was not metadata, skip over data we cannot handle
                for (int skiplen = 0; skiplen < datalen;) {
                    long currentSkipLen = datainput.skip(datalen - skiplen);
                    skiplen += currentSkipLen;
                }

As I don't know which FLV is downloaded that caused the problem I cannot easily create a testcase.


> Tika FLVParser hangs
> --------------------
>
>                 Key: TIKA-448
>                 URL: https://issues.apache.org/jira/browse/TIKA-448
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Linux JDK 1.6u13, Nutch 1.1
>            Reporter: Jeroen van Vianen
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.FilterInputStream.skip(FilterInputStream.java:125)
>         at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
>         at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
>                 // Tag was not metadata, skip over data we cannot handle
>                 for (int skiplen = 0; skiplen < datalen;) {
>                     long currentSkipLen = datainput.skip(datalen - skiplen);
>                     skiplen += currentSkipLen;
>                 }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-448) Tika FLVParser hangs

Posted by "Jeroen van Vianen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeroen van Vianen updated TIKA-448:
-----------------------------------

    Attachment: FLVParser.patch

I patched my tika-parsers.jar with the above patch. It at least solved my problem, but I'm not sure whether this is because of the offending (corrupt?) FLV not being fetched during my current nutch run or because of the patch solving the problem.

@Julien: thanks for the tip

> Tika FLVParser hangs
> --------------------
>
>                 Key: TIKA-448
>                 URL: https://issues.apache.org/jira/browse/TIKA-448
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Linux JDK 1.6u13, Nutch 1.1
>            Reporter: Jeroen van Vianen
>         Attachments: FLVParser.patch
>
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.FilterInputStream.skip(FilterInputStream.java:125)
>         at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
>         at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
>                 // Tag was not metadata, skip over data we cannot handle
>                 for (int skiplen = 0; skiplen < datalen;) {
>                     long currentSkipLen = datainput.skip(datalen - skiplen);
>                     skiplen += currentSkipLen;
>                 }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-448) Tika FLVParser hangs

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883858#action_12883858 ] 

Jukka Zitting commented on TIKA-448:
------------------------------------

The InputStream.skip() method can always return 0 if it wants, see IO-203 for related discussion.

It might be easiest to simply always read() the tag content into memory instead of trying to skip() it. The performance and memory overhead shouldn't be too high.

> Tika FLVParser hangs
> --------------------
>
>                 Key: TIKA-448
>                 URL: https://issues.apache.org/jira/browse/TIKA-448
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Linux JDK 1.6u13, Nutch 1.1
>            Reporter: Jeroen van Vianen
>         Attachments: FLVParser.patch
>
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.FilterInputStream.skip(FilterInputStream.java:125)
>         at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
>         at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
>                 // Tag was not metadata, skip over data we cannot handle
>                 for (int skiplen = 0; skiplen < datalen;) {
>                     long currentSkipLen = datainput.skip(datalen - skiplen);
>                     skiplen += currentSkipLen;
>                 }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-448) Tika FLVParser hangs

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883660#action_12883660 ] 

Julien Nioche commented on TIKA-448:
------------------------------------

I have seen similar cases with FLV when the content fetched by Nutch had been trimmed. Setting the log level to debug should give you more information about which URL is problematic.
One simple workaround for cases like these (apart from filtering on *.flv of course) is to use the skip record options in Hadoop 

{code}
  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
  ./nutch parse $commonOptions $skipRecordsOptions $hdfspath/segments/$SEGMENT
{code}

this will skip the problematic entries after a couple of retries.

Of course preventing the flv parser to loop would be even better. I'll see if I can reproduce the problem later

> Tika FLVParser hangs
> --------------------
>
>                 Key: TIKA-448
>                 URL: https://issues.apache.org/jira/browse/TIKA-448
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Linux JDK 1.6u13, Nutch 1.1
>            Reporter: Jeroen van Vianen
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.FilterInputStream.skip(FilterInputStream.java:125)
>         at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
>         at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
>                 // Tag was not metadata, skip over data we cannot handle
>                 for (int skiplen = 0; skiplen < datalen;) {
>                     long currentSkipLen = datainput.skip(datalen - skiplen);
>                     skiplen += currentSkipLen;
>                 }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.