You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/28 15:35:35 UTC
[jira] Created: (NUTCH-290) parse-pdf: Garbage (?) indexed when
text-extraction now allowed
parse-pdf: Garbage (?) indexed when text-extraction now allowed
---------------------------------------------------------------
Key: NUTCH-290
URL: http://issues.apache.org/jira/browse/NUTCH-290
Project: Nutch
Type: Bug
Components: indexer
Versions: 0.8-dev
Reporter: Stefan Neufeind
It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
Example-PDF:
http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when
text-extraction not allowed
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ]
Stefan Neufeind commented on NUTCH-290:
---------------------------------------
The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly.
However I still get the "garbage-output" from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index?
What I did was deleting crawl_parse and parse_* from the segments-directory, running "nutch parse" and reindexing everything. However the raw chars in the search-output (summary) remain. :-((
> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when
text-extraction not allowed
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ]
Stefan Groschupf commented on NUTCH-290:
----------------------------------------
If a parser throws an exeption:
Fetcher, 261:
try {
parse = this.parseUtil.parse(content);
parseStatus = parse.getData().getStatus();
} catch (Exception e) {
parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
LOG.warning("Error parsing: " + key + ": " + parseStatus);
parse = parseStatus.getEmptyParse(getConf());
}
than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {
private ParseData data = null;
public EmptyParseImpl(ParseStatus status, Configuration conf) {
data = new ParseData(status, "", new Outlink[0],
new Metadata(), new Metadata());
data.setConf(conf);
}
public ParseData getData() {
return data;
}
public String getText() {
return "";
}
}
So the Problem should be somewhere else.
> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when
text-extraction not allowed
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ]
Stefan Neufeind commented on NUTCH-290:
---------------------------------------
But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case.
Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here?
> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when
text-extraction not allowed
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ]
Stefan Groschupf commented on NUTCH-290:
----------------------------------------
As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it?
> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-290) parse-pdf: Garbage indexed when
text-extraction not allowed
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=all ]
Stefan Neufeind updated NUTCH-290:
----------------------------------
Summary: parse-pdf: Garbage indexed when text-extraction not allowed (was: parse-pdf: Garbage (?) indexed when text-extraction now allowed)
> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-290) parse-pdf: Garbage (?) indexed when
text-extraction now allowed
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=all ]
Stefan Neufeind updated NUTCH-290:
----------------------------------
Attachment: NUTCH-290-canExtractContent.patch
This patch adds a check to first see if text-extraction is allowed - and only in that case try to extract text (prevents the above mentioned exception and a parse-fail).
Note: The line
((PDStandardEncryption) encDict).setCanExtractContent(true);
is imho up to discussion. It only sets a bit on "encrypted" documents. Since I've read in several places that many people seem to be setting this to "false" for no good reason, I believe we don't really "brake encryption" with this line - and as such should try to index as much data as possible.
Does anybody have "problems" with this line? If yes, maybe it could be a config-option that's false by default?
> parse-pdf: Garbage (?) indexed when text-extraction now allowed
> ---------------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage (?) indexed when
text-extraction now allowed
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413637 ]
Stefan Neufeind commented on NUTCH-290:
---------------------------------------
this one here fires in the PDF-parser:
} catch (Exception e) { // run time exception
LOG.warning("General exception in PDF parser: "+e.getMessage());
e.printStackTrace();
return new ParseStatus(ParseStatus.FAILED,
"Can't be handled as pdf document. " + e).getEmptyParse(getConf());
}
The exception is:
060522 001010 General exception in PDF parser: You do not have permission to extract text
java.io.IOException: You do not have permission to extract text
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143)
Could it be that, maybe as a fallback, in case the document can't be parsed and no "description" is returned that in search-output the document itself is used as "description"? If yes: In case of binary files this seems to lead to problems.
> parse-pdf: Garbage (?) indexed when text-extraction now allowed
> ---------------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when
text-extraction not allowed
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ]
Stefan Neufeind commented on NUTCH-290:
---------------------------------------
But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback.
Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data.
> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
> Key: NUTCH-290
> URL: http://issues.apache.org/jira/browse/NUTCH-290
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira