You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2016/05/21 05:59:12 UTC
[jira] [Comment Edited] (CONNECTORS-1317) Hang crawling job on some
ZIP documents
[ https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294748#comment-15294748 ]
Karl Wright edited comment on CONNECTORS-1317 at 5/21/16 5:58 AM:
------------------------------------------------------------------
I have verified that the attached file is successfully extracted by Tika on a trunk build:
{code}
Start Time Activity Identifier Result Code Bytes Time Result Description
05-21-2016 01:56:01.173 output notification (null) OK 0 1
05-21-2016 01:55:51.176 job end 1463810108510(test)
0 1
05-21-2016 01:55:43.736 document ingest (null) file:/C:/testdata/something.zip
OK 364 1
05-21-2016 01:55:43.504 extract [tika] file:/C:/testdata/something.zip
OK 364 216
05-21-2016 01:55:43.233 read document C:\testdata\something.zip
OK 19806 507
05-21-2016 01:55:41.211 read document C:\testdata
OK 0 1
05-21-2016 01:55:41.133 job start 1463810108510(test)
0 1
{code}
I therefore strongly suggest you check out trunk and build it. Instructions are provided on the "how to build and deploy" page on the web site. Please let me know if this works for you.
was (Author: kwright@metacarta.com):
I have verified that the attached file is successfully extracted by Tika on a trunk build:
{code}
Start Time Activity Identifier Result Code Bytes Time Result Description
05-21-2016 01:56:01.173 output notification (null) OK 0 1
05-21-2016 01:55:51.176 job end 1463810108510(test)
0 1
05-21-2016 01:55:43.736 document ingest (null) file:/C:/testdata/something.zip
OK 364 1
05-21-2016 01:55:43.504 extract [tika] file:/C:/testdata/something.zip
OK 364 216
05-21-2016 01:55:43.233 read document C:\testdata\something.zip
OK 19806 507
05-21-2016 01:55:41.211 read document C:\testdata
OK 0 1
05-21-2016 01:55:41.133 job start 1463810108510(test)
0 1
{code}
I therefore strongly suggest you check out trunk and build it. Instructions are provided on the "how to build and deploy" page on the web site.
> Hang crawling job on some ZIP documents
> ---------------------------------------
>
> Key: CONNECTORS-1317
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
> Project: ManifoldCF
> Issue Type: Bug
> Components: File system connector
> Affects Versions: ManifoldCF 2.3
> Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
> Reporter: Mr.Keuz
> Assignee: Karl Wright
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip files. Although some files parsing normally.
> Steps:
> 1. Run ManfoldCF by "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug:
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same zip file again and again (it seems from different workers threads). And It seems that document not removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.
> I can send some additional info if needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)