You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2016/05/21 05:59:12 UTC
[jira] [Comment Edited] (CONNECTORS-1317) Hang crawling job on some ZIP documents

    [ https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294748#comment-15294748 ] 

Karl Wright edited comment on CONNECTORS-1317 at 5/21/16 5:58 AM:
------------------------------------------------------------------

I have verified that the attached file is successfully extracted by Tika on a trunk build:

{code}
Start Time	Activity	Identifier	Result Code	Bytes	Time	Result Description
05-21-2016 01:56:01.173	output notification (null)		OK	0	1	
05-21-2016 01:55:51.176	job end	1463810108510(test)
0	1	
05-21-2016 01:55:43.736	document ingest (null)	file:/C:/testdata/something.zip
OK	364	1	
05-21-2016 01:55:43.504	extract [tika]	file:/C:/testdata/something.zip
OK	364	216	
05-21-2016 01:55:43.233	read document	C:\testdata\something.zip
OK	19806	507	
05-21-2016 01:55:41.211	read document	C:\testdata
OK	0	1	
05-21-2016 01:55:41.133	job start	1463810108510(test)
0	1	
{code}

I therefore strongly suggest you check out trunk and build it.  Instructions are provided on the "how to build and deploy" page on the web site.  Please let me know if this works for you.





was (Author: kwright@metacarta.com):
I have verified that the attached file is successfully extracted by Tika on a trunk build:

{code}
Start Time	Activity	Identifier	Result Code	Bytes	Time	Result Description
05-21-2016 01:56:01.173	output notification (null)		OK	0	1	
05-21-2016 01:55:51.176	job end	1463810108510(test)
0	1	
05-21-2016 01:55:43.736	document ingest (null)	file:/C:/testdata/something.zip
OK	364	1	
05-21-2016 01:55:43.504	extract [tika]	file:/C:/testdata/something.zip
OK	364	216	
05-21-2016 01:55:43.233	read document	C:\testdata\something.zip
OK	19806	507	
05-21-2016 01:55:41.211	read document	C:\testdata
OK	0	1	
05-21-2016 01:55:41.133	job start	1463810108510(test)
0	1	
{code}

I therefore strongly suggest you check out trunk and build it.  Instructions are provided on the "how to build and deploy" page on the web site.




> Hang crawling job on some ZIP documents
> ---------------------------------------
>
>                 Key: CONNECTORS-1317
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
>            Assignee: Karl Wright
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG 
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same zip file again and again (it seems from different workers threads). And It seems that document not removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.
> I can send some additional info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)