You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeroen van Vianen (JIRA)" <ji...@apache.org> on 2010/06/29 19:23:51 UTC
[jira] Created: (TIKA-448) Tika FLVParser hangs
Tika FLVParser hangs
--------------------
Key: TIKA-448
URL: https://issues.apache.org/jira/browse/TIKA-448
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.7
Environment: Linux JDK 1.6u13, Nutch 1.1
Reporter: Jeroen van Vianen
I am crawling a site with Nutch and creating an index using SOLR.
After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
java.lang.Thread.State: RUNNABLE
at java.io.FilterInputStream.skip(FilterInputStream.java:125)
at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
// Tag was not metadata, skip over data we cannot handle
for (int skiplen = 0; skiplen < datalen;) {
long currentSkipLen = datainput.skip(datalen - skiplen);
skiplen += currentSkipLen;
}
As I don't know which FLV is downloaded that caused the problem I cannot easily create a testcase.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-448) Tika FLVParser hangs
Posted by "Jeroen van Vianen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeroen van Vianen updated TIKA-448:
-----------------------------------
Description:
I am crawling a site with Nutch and creating an index using SOLR.
After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
java.lang.Thread.State: RUNNABLE
at java.io.FilterInputStream.skip(FilterInputStream.java:125)
at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
// Tag was not metadata, skip over data we cannot handle
for (int skiplen = 0; skiplen < datalen;) {
long currentSkipLen = datainput.skip(datalen - skiplen);
skiplen += currentSkipLen;
}
As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.
was:
I am crawling a site with Nutch and creating an index using SOLR.
After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
java.lang.Thread.State: RUNNABLE
at java.io.FilterInputStream.skip(FilterInputStream.java:125)
at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
// Tag was not metadata, skip over data we cannot handle
for (int skiplen = 0; skiplen < datalen;) {
long currentSkipLen = datainput.skip(datalen - skiplen);
skiplen += currentSkipLen;
}
As I don't know which FLV is downloaded that caused the problem I cannot easily create a testcase.
> Tika FLVParser hangs
> --------------------
>
> Key: TIKA-448
> URL: https://issues.apache.org/jira/browse/TIKA-448
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Linux JDK 1.6u13, Nutch 1.1
> Reporter: Jeroen van Vianen
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
> java.lang.Thread.State: RUNNABLE
> at java.io.FilterInputStream.skip(FilterInputStream.java:125)
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
> // Tag was not metadata, skip over data we cannot handle
> for (int skiplen = 0; skiplen < datalen;) {
> long currentSkipLen = datainput.skip(datalen - skiplen);
> skiplen += currentSkipLen;
> }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-448) Tika FLVParser hangs
Posted by "Jeroen van Vianen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeroen van Vianen updated TIKA-448:
-----------------------------------
Attachment: FLVParser.patch
I patched my tika-parsers.jar with the above patch. It at least solved my problem, but I'm not sure whether this is because of the offending (corrupt?) FLV not being fetched during my current nutch run or because of the patch solving the problem.
@Julien: thanks for the tip
> Tika FLVParser hangs
> --------------------
>
> Key: TIKA-448
> URL: https://issues.apache.org/jira/browse/TIKA-448
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Linux JDK 1.6u13, Nutch 1.1
> Reporter: Jeroen van Vianen
> Attachments: FLVParser.patch
>
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
> java.lang.Thread.State: RUNNABLE
> at java.io.FilterInputStream.skip(FilterInputStream.java:125)
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
> // Tag was not metadata, skip over data we cannot handle
> for (int skiplen = 0; skiplen < datalen;) {
> long currentSkipLen = datainput.skip(datalen - skiplen);
> skiplen += currentSkipLen;
> }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-448) Tika FLVParser hangs
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883858#action_12883858 ]
Jukka Zitting commented on TIKA-448:
------------------------------------
The InputStream.skip() method can always return 0 if it wants, see IO-203 for related discussion.
It might be easiest to simply always read() the tag content into memory instead of trying to skip() it. The performance and memory overhead shouldn't be too high.
> Tika FLVParser hangs
> --------------------
>
> Key: TIKA-448
> URL: https://issues.apache.org/jira/browse/TIKA-448
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Linux JDK 1.6u13, Nutch 1.1
> Reporter: Jeroen van Vianen
> Attachments: FLVParser.patch
>
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
> java.lang.Thread.State: RUNNABLE
> at java.io.FilterInputStream.skip(FilterInputStream.java:125)
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
> // Tag was not metadata, skip over data we cannot handle
> for (int skiplen = 0; skiplen < datalen;) {
> long currentSkipLen = datainput.skip(datalen - skiplen);
> skiplen += currentSkipLen;
> }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-448) Tika FLVParser hangs
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883660#action_12883660 ]
Julien Nioche commented on TIKA-448:
------------------------------------
I have seen similar cases with FLV when the content fetched by Nutch had been trimmed. Setting the log level to debug should give you more information about which URL is problematic.
One simple workaround for cases like these (apart from filtering on *.flv of course) is to use the skip record options in Hadoop
{code}
skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
./nutch parse $commonOptions $skipRecordsOptions $hdfspath/segments/$SEGMENT
{code}
this will skip the problematic entries after a couple of retries.
Of course preventing the flv parser to loop would be even better. I'll see if I can reproduce the problem later
> Tika FLVParser hangs
> --------------------
>
> Key: TIKA-448
> URL: https://issues.apache.org/jira/browse/TIKA-448
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Linux JDK 1.6u13, Nutch 1.1
> Reporter: Jeroen van Vianen
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
> java.lang.Thread.State: RUNNABLE
> at java.io.FilterInputStream.skip(FilterInputStream.java:125)
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:
> // Tag was not metadata, skip over data we cannot handle
> for (int skiplen = 0; skiplen < datalen;) {
> long currentSkipLen = datainput.skip(datalen - skiplen);
> skiplen += currentSkipLen;
> }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create a testcase.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.