You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/28 15:35:35 UTC

[jira] Created: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

parse-pdf: Garbage (?) indexed when text-extraction now allowed
---------------------------------------------------------------

         Key: NUTCH-290
         URL: http://issues.apache.org/jira/browse/NUTCH-290
     Project: Nutch
        Type: Bug

  Components: indexer  
    Versions: 0.8-dev    
    Reporter: Stefan Neufeind


It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.

Example-PDF:
http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] 

Stefan Neufeind commented on NUTCH-290:
---------------------------------------

The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly.
However I still get the "garbage-output" from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index?

What I did was deleting crawl_parse and parse_* from the segments-directory, running "nutch parse" and reindexing everything. However the raw chars in the search-output (summary) remain. :-((

> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:
----------------------------------------

If a parser throws an exeption:
Fetcher, 261:
 try {
          parse = this.parseUtil.parse(content);
          parseStatus = parse.getData().getStatus();
        } catch (Exception e) {
          parseStatus = new ParseStatus(e);
        }
        if (!parseStatus.isSuccess()) {
          LOG.warning("Error parsing: " + key + ": " + parseStatus);
          parse = parseStatus.getEmptyParse(getConf());
        }

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {
    
    private ParseData data = null;
    
    public EmptyParseImpl(ParseStatus status, Configuration conf) {
      data = new ParseData(status, "", new Outlink[0],
                           new Metadata(), new Metadata());
      data.setConf(conf);
    }
    
    public ParseData getData() {
      return data;
    }

    public String getText() {
      return "";
    }
  }
 So the Problem should be somewhere else.

> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] 

Stefan Neufeind commented on NUTCH-290:
---------------------------------------

But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case.

Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here?

> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:
----------------------------------------

As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it?


> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-290?page=all ]

Stefan Neufeind updated NUTCH-290:
----------------------------------

    Summary: parse-pdf: Garbage indexed when text-extraction not allowed  (was: parse-pdf: Garbage (?) indexed when text-extraction now allowed)

> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-290?page=all ]

Stefan Neufeind updated NUTCH-290:
----------------------------------

    Attachment: NUTCH-290-canExtractContent.patch

This patch adds a check to first see if text-extraction is allowed - and only in that case try to extract text (prevents the above mentioned exception and a parse-fail).

Note: The line

  ((PDStandardEncryption) encDict).setCanExtractContent(true);

is imho up to discussion. It only sets a bit on "encrypted" documents. Since I've read in several places that many people seem to be setting this to "false" for no good reason, I believe we don't really "brake encryption" with this line - and as such should try to index as much data as possible.
Does anybody have "problems" with this line? If yes, maybe it could be a config-option that's false by default?

> parse-pdf: Garbage (?) indexed when text-extraction now allowed
> ---------------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413637 ] 

Stefan Neufeind commented on NUTCH-290:
---------------------------------------

this one here fires in the PDF-parser:

    } catch (Exception e) { // run time exception
        LOG.warning("General exception in PDF parser: "+e.getMessage());
        e.printStackTrace();        
      return new ParseStatus(ParseStatus.FAILED,
              "Can't be handled as pdf document. " + e).getEmptyParse(getConf());
    }

The exception is:

060522 001010 General exception in PDF parser: You do not have permission to extract text
java.io.IOException: You do not have permission to extract text
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189)
        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
        at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143)


Could it be that, maybe as a fallback, in case the document can't be parsed and no "description" is returned that in search-output the document itself is used as "description"? If yes: In case of binary files this seems to lead to problems.

> parse-pdf: Garbage (?) indexed when text-extraction now allowed
> ---------------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind

>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] 

Stefan Neufeind commented on NUTCH-290:
---------------------------------------

But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback.

Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data.

> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira