You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mark Aragon (Jira)" <ji...@apache.org> on 2019/10/06 13:17:00 UTC
[jira] [Updated] (NUTCH-2742) Unable to parse specific pdf file
[ https://issues.apache.org/jira/browse/NUTCH-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Aragon updated NUTCH-2742:
-------------------------------
Description:
It appears that the Tika plugin is not parsing some PDF files.
An example is "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf"
When I completed a dump of the segment data there is no content
```
Recno:: 0
URL:: [https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf]
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Oct 07 00:00:37 AEDT 2019
Modified time: Thu Jan 01 10:00:00 AEST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
_ngt_=1570366841510
```
was:
It appears that the Tika plugin is not parsing some PDF files.
An example is "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf"
When I completed a dump of the
```
Recno:: 0
URL:: https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Oct 07 00:00:37 AEDT 2019
Modified time: Thu Jan 01 10:00:00 AEST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
_ngt_=1570366841510
```
> Unable to parse specific pdf file
> ---------------------------------
>
> Key: NUTCH-2742
> URL: https://issues.apache.org/jira/browse/NUTCH-2742
> Project: Nutch
> Issue Type: Bug
> Components: nutchNewbie, parser
> Affects Versions: 1.15
> Reporter: Mark Aragon
> Priority: Minor
>
> It appears that the Tika plugin is not parsing some PDF files.
> An example is "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf"
> When I completed a dump of the segment data there is no content
>
> ```
> Recno:: 0
> URL:: [https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf]
>
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Mon Oct 07 00:00:37 AEDT 2019
> Modified time: Thu Jan 01 10:00:00 AEST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata:
> _ngt_=1570366841510
> ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)