You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2005/12/06 23:04:09 UTC

[jira] Created: (NUTCH-133) ParserFactory does not work as expected

ParserFactory does not work as expected
---------------------------------------

         Key: NUTCH-133
         URL: http://issues.apache.org/jira/browse/NUTCH-133
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev, 0.7.1, 0.7.2-dev    
    Reporter: Stefan Groschupf
    Priority: Blocker


Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
>From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
Find a conclusion of the problems below.

Problem:
Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 

Sample:
returns "text/HTML" or "application/PDF" or Content-length
or this url:
http://www.lanka.info/dictionary/EnglishToSinhala.jsp

Solution:
First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
e.g.:
HttpResponse.java, line 353:
use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();


Problem:
MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.

see:
 public Content toContent() {
    String contentType = getHeader("Content-Type");
    if (contentType == null) {
      MimeType type = null;
      if (MAGIC) {
        type = MIME.getMimeType(orig, content);
      } else {
        type = MIME.getMimeType(orig);
      }
      if (type != null) {
          contentType = type.getName();
      } else {
          contentType = "";
      }
    }
    return new Content(orig, base, content, contentType, headers);
  }
Solution:
Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.



Problem:
Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.

Solution:
Fetcher.java, line 243.
Change:   if (!Fetcher.this.parsing ) { .. to 
 if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
       // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
          outputPage(new FetcherOutput(fle, hash, protocolStatus),
                content, new ParseText(""),
                new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
        return null;
      }


Problem:
Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 

Solution:
Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.


Problem:
there is not a clear differentiation between content type and mime type. 
I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.

Solution:
Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.



Problem:
Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 

Solution. 
Remove this detection code, since it is now in the parser factory.
I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.


Problem:
This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
  /** Test of <code>getExtensions</code> method. */
    public void testGetExtensions() {
    }
Solution:
Implement these tests or remove the test methods.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359725 ] 

Stefan Groschupf commented on NUTCH-133:
----------------------------------------

Doug, 
ok, I will split things in different patches and open a set of new bugs. 
Jerome: 
If you take a carefully look to my patch you will notice that I already provide a solution how to handle server header content type, url extension and content type detection in a sensefull way, just take a look to the new ParserFactory. Don't wast your time I will get things fixed with may Dougs hints. I would be very happy if you can focus on MimeTypes, since this is the heart of the mime type guessing and it needs also a lot of improvements and as Chris mentioned you are the 'expert' in this area.
 Chris, 
for sure the rewritten classes have the same name and the same methods since everything else would break the api, isn't it?  

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-133) ParserFactory does not work as expected

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]

Stefan Groschupf updated NUTCH-133:
-----------------------------------

    Attachment: ParserFactoryPatch_nutch.0.7_patch.txt

A patch that solves the described problems for nutch 0.7. 
MimeTypes detection is now REALLY used, since it wasn't used before only in some exceptions (mixed cases header keys). 
So creadits for this still to the old crew, however parserFactory and ParseUtil is rewritten, now with less code and may better structured. 
Also I changed that,  only lower case keys are used  for meta data extracted from the header  now, so there are some smaller changes in a lot of classes.  This is just a fix to get  (from my tests)  more pages parsed than before. It is definitly not the final solution since for example we need to change the plugin id based parser configuration to a extension id based configuration.
Nutch junit tests pass on my box, but give it a try and let me know if this improve your results as well.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359653 ] 

Doug Cutting commented on NUTCH-133:
------------------------------------

I think we should distinguish between the value of content.getContentType() and metaData.get("Content-Type").  The former should be trusted and the latter should be what was declared by the server.

So, for HTTP we could do something like:

private String getContentType(HashMap headers, byte[] data) {
  String typeName = headers.get("Content-Type");
  MimeType type = typeName = null ? null : MimeTypes.getMimeType(typeName);
  if (typeName == null ||
     type = null ||
     (type.hasMagic() && !type.matches(data))  {
    mimeType = MimeTypes.getMimeType(data);
    typeName = mimeType.toString()
  }
  return typeName; 
}

This always double-checks that types match their magic (if any is defined) and only tries all magic strings when the declared type's magic is mismatched.  Perhaps this checking could even be moved to the Content constructor.  Thoughts?

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359729 ] 

Jerome Charron commented on NUTCH-133:
--------------------------------------

Stefan:
Taking a closer look at the ParserFactory patch:

1. You can use the MimeType.clean(String) static method to clean the content-type
2. In the actual MimeTypes implementation, the getMimeType(String, byte[]) method returns the MimeType from the document name if one match (without guessing from magic). So, use, the getMimeType(byte[] if you want to guess content type from magic).
3. Your patch doen't really try to guess the content-type, but instead it will try to parse the content by using the parsers declared for the header content-type AND the then by the ones declared for the file extension detected content-type: It means that you guess header content-type is more reliable... no?
4. There's too much calls to .toLowerCase() and .euqlasIgnoreCase() methods in your code. One of the major Java bottleneck is the String manipulations, so the basic idea is to use the less you can the string manipulations.
5. Looking at http://www.w3.org/TR/REC-html40/types.html#h-6.7 it seems that content-type are case-insensitive, so the solution to deal with content-type sensitivity is to simply to patch the MimeType.clean(String) method so that it performs a toLowerCase to the mime-type.


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-133) ParserFactory does not work as expected

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]
     
Stefan Groschupf closed NUTCH-133:
----------------------------------

    Resolution: Won't Fix

We will split the problems described here into a set of bugs to fix things step by step.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359634 ] 

Doug Cutting commented on NUTCH-133:
------------------------------------

Stefan, sorry I missed the test case.

If others agree that these cases should pass, then we should commit the test case alone as a start.  Then we can separately decide how to fix things.  From your description, with six "solutions", perhaps this should really be six separate patches to six separate bugs.

For example, dealing with case in content-type is a separable issue.  Should all metadata keys be case-insensitive?  If so then we should instead probably implement this with a case-insensitive Properties, a TreeMap using String.CASE_INSENSITIVE_ORDER.


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Lutischán Ferenc (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359564 ] 

Lutischán Ferenc commented on NUTCH-133:
----------------------------------------

Dear Stephan,

Please see http://issues.apache.org/jira/browse/NUTCH-123.
This problem is also problem in cached.jsp.

Regards,
            Ferenc

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359647 ] 

Jerome Charron commented on NUTCH-133:
--------------------------------------

Stefan:

1. url extentions and also magic content type detection are used. This is the only way protocol-file and protocol-ftp can guess the content-type of a document (see FileResponse.java and FtpResponse.java).
So, the problem is "only" for HTTP.
I suggest to you for a "ASAP" solution to patch the HTTP related plugins by systematically using the mime-type resolver. But what is the policy to apply if you have both a mime-type from the protocol layer and another one from the mime-type solver? Which one to use (we have not yet stated on this...) What do you think about it?

2. I'm ok with Doug. This issue should be splitted in six separate issues.

3. Unit Tests: I'm ok to commit tests provided by Stefan about the content-type case. But I'm not sure that the TestParseUtil is the right place for this. It doesn't test the ParseUtil itself, but the way meta-data keys are stored in nutch.

4. I think we can use case-insensitive metadata keys. I don't know any protocol for which the case sensitivity is really used for headers or metadata keys (even if the specification says they are case sensitive).



> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359610 ] 

Stefan Groschupf commented on NUTCH-133:
----------------------------------------

Jerome: 
Since 3 months or so url extentions and also magic content type detection is never used. I suggest to assign this patch asap since many people are working with the latest sources nightly builds, we have a kind of responsibility - so we should focus on this issue today. 
I also suggest, we should make a maintenance release for the 0.7 asap,  since this problem is also part of the latest release. The solution today took us 2 steps back, because the solution used before first also used url extensions, secondly tried all parsers in the list and took the one that was able to parse the content. Today we just use the content type returned by the servers (this is many many times wrong) and we use just one parser. 
So today there is just no content type detection used at all. 
If you can improve and work on MimeTypes.java that would be great. I see many space for improvements. 
For example in case just the char encoding is different than utf8  the mime type util detects a html page as gzip application. 
I didn't change anything on your MimeType classes, so you can  straight forward improve it and use it with my new ParserFactory.
So pleace focus on this  MimeTypes.java improvements! If you have a version, please put it as patch to the jira so all developer can review it before you commit it to the svn. 


Chris: 
Thanks for taking the time to take a look. 
First, I only rewrote classes where it was absolutely necessary, also you will notice that only rewritten classes has a new author (my IDE  sets the author name until create a new File). Secondly only classes that were unused were removed. The code does exactly what you guys had described, what a parserfactory should do and not just one part of  your concept was changed, now it just works and may has less lines of  code. 

The REAL bug is that by today i'm able to parse less than 80 % of pages with my testset in a intranet project with a wrong configured ISS we was not able to parse anything at all, so don't mix things - this is a serious blocker bug! First we talk about that by today the parser selction do not use any kind of detection mechanism and than we talk also about improvements.
Regarding your last comment, using a extension id in you parser configuation file instead of a plugin id would give you the chance to have a order of parsers, also if the parsers ships inside the same plugin, since extension ids are unique in one plugin and all plugins.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359770 ] 

Jerome Charron commented on NUTCH-133:
--------------------------------------

Doug,

Oh, yes, I understand what you mean. Yes, that realy make sense.
I will commit a patch of Content for this (and removing all over MimeType resolution spread over the code) in the next days.


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359627 ] 

Stefan Groschupf commented on NUTCH-133:
----------------------------------------

Doug, I already attached a unit test that call ParseUtil.parse(Content) and simulate the different scenarios.
I can extend the test in case you tell me what you would love to see more than what I already test.


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359645 ] 

Chris A. Mattmann commented on NUTCH-133:
-----------------------------------------

Hi Stefan,

  Thanks for your reply. I actually like a lot of your proposed changes having to do with the MimeType cleansing and the detection mechanism. I think that you should continue to work with Jerome on that as he seems to be the guy who is the MimeType guru on the project.

With regards to your comment about "..First, I only rewrote classes where it was absolutely necessary...", I respectfully disagree with this assertion. Take the TestParserFactory test case class. Your patch almost completely removes its methods, however, all of the methods in the TestParserFactory class that is currently committed in Nutch are in my mind extremely relevant tests to ensure proper operation of the ParserFactory. Continuing on, your patch completely removes the ParsePluginsReader class and replaces with with a class, "ParsePluginConf", that does the exact same thing (i.e., reading the parse-plugins.xml file). Your changes in the ParseUtil class are equally as perplexing. You remove the public final static Parse parse(Content content) throws ParseException method, and replace it with a method public static Parse parse(Content content) throws ParseException that seemingly does the exact same thing as the previous method? The same is true for the public final static Parse parseByParserId(String parserId, Content content) in the same class, it is replaced in your patch with seemingly the exact same method minus the "final" declaration? 

These are the things I was talking about. This patch file, rather than just updating the current sources with the true bits and pieces of your new contribution in places, completely removes methods, files, etc., and then replaces them with much of the same code, at least in the ParseUtil, ParsePluginsReader, etc. classes that I have looked at. 

This is not to say that the rest of your suggestions and bugs aren't worthwhile, nor your suggestion to use the extensionId rather than the pluginId in the parse-plugins.xml file. These ideas are definitely interesting and worth exploring amongst the whole group working on these issues.

Thanks,
  Chris


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359679 ] 

Chris A. Mattmann commented on NUTCH-133:
-----------------------------------------

Hi Doug,

 I like this idea for the getContentType method. In general, I completely agree that the server provided content type cannot really be trusted.

As for moving this to the Content class constructor, I think that the move makes sense. Jerome, et al, thoughts?
Just to note,  this creates an explicit dependency on the mimeType resolver for constructing new content objects, which may or may not affect performance?

Thanks,
  Chris


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359717 ] 

Jerome Charron commented on NUTCH-133:
--------------------------------------

I think Doug proposal is the good way of solving this content-type issue.
This solution just miss the "guess mime-type from file extension".
In fact, we have 3 clues to guess the content-type of a document:
1. The Content-Type header
2. The file name extension
3. The file content (magic bytes).

We must use all these informations. But dealing with 3 clues is harder than dealing with only 2.
What I suggest is:
1. Guess the the content-type from the file content (magic).  [only if the mime.type.magic property is true]
2. If a content-type is found, then return it (magic information is the more reliable one, so I suggest that there is no need for double/triple check with file extension and header).
3. Otherwise:
3.1  Use a double check (as in Doug piece of code) for file extension content-type and header content-type. But what is the more important clue? The header or the file extension? On the server side, for static files, the content-type is generaly setted using file extensions => the file extension is then the more reliable information (it provides a way to deal with bad configuration on the server side). For dynamic files, the content-type the content-type is setted by the application server, by the servlet / cgi / ..., or by the page, and the file name can have any (or no) extension => in such a case, the header is the more reliable clue we have... So what is the best way of solving content-type if no magic bytes resolution is available?
What I suggest in a first approach is to use the double check suggested by doug, but on the file-extension instead of on the magic bytes.

If you are ok with this I can provide a patch of the Content.java file by the end of the week.
(Chris, here is my feeling about nutch performance: the more we add "intelligence" (features) to nutch, the more we decrease its performance. True that performance is a very sensitive point for nutch, but nutch global performance is more an affair of scalability (designed by architecure) than an affair of piece of code raw performance).


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359502 ] 

Chris A. Mattmann commented on NUTCH-133:
-----------------------------------------

>From an initial quick glance at much of the patch, I see that many of existing working classes are just rewritten, given different names, and with slightly different code. A lot of the patch completely removes entire files, and then replaces them with somewhat similar files, instead of just patching the parts of the existing file with the new contributions. Additionally, authors are renamed on some of the files which isn't a very friendly pratice IMHO. I think a lot of this patch will need some further review to determine what in fact are actually new contributions of the patch, and what is still working and doesn't need to be replaced, no? Also, in general, it's bad practice to remove classes unless they are absolutely unnecessary, and this patch removes several classes that are seemingly still necessary (e.g., ParsePluginList, ParsePluginReader, etc.). 

That's just my two cents.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-133) ParserFactory does not work as expected

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]

Stefan Groschupf updated NUTCH-133:
-----------------------------------

    Attachment: Parserutil_test_patch.txt

A test that reproduce most problems, see a real world sample url in the conclusion above.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359487 ] 

Jerome Charron commented on NUTCH-133:
--------------------------------------

Thanks for this really very good description.
Just a quick note: I'm currently in the final steps of a new mime-type repository implementation (compliant with freedesktop specification). So, I suggest not to focus on the mime-type issues for now.

About the MimeResolution moved to the parser factory: +1.
(As you probably notice by looking at the comments in the code, it was planned... when the new mime type repository will be available. But unfortunaly, the it takes more time than excpected).


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359603 ] 

Chris A. Mattmann commented on NUTCH-133:
-----------------------------------------

Just another comment on the issue. The reported "bug" listed as the following:

Problem: 
Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 

Solution: 
Change plugin id to extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere. 


in my mind is not a "bug" at all really. It's intended behavior. While it is true that we based the redesign of the ParserFactory in NUTCH-88 on having the "pluginId" used as the key to map parsing plugins to mimeTypes, rather than the "extensionId", and while it is also true that a a particular pluginId may have several extensions, the way that we have implemented NUTCH-88 is by no means limiting. Consider the following situation in which I write a plugin "parse-foo", which is a "parsing only" plugin that provides several different parser implementations for handling the contentType "foo". Okay, so here's my question. Say I provided parser iimplementations A, B, C, and D. How do you know which one should get called in which order for the content type foo? Should A get called first, because it's first in the plugin.xml file? Should D get called last because it is last? Neither of these are the only correct answer, but they may be to some people, and they may not be to others.

Thus, the situation that you describe as a "problem" in our eyes was never really a problem per se. It all has to do with the way that parsing plugins are implemented in Nutch. To me, parsing plugins are a special class of plugins really. If you take a look at all the parse-xxx plugins in the $NUTCH_HOME/src/plugin directory, you see the following situation with respect to the parser implemetnations that each parsing plugin provides:

parse-ext - provides 2, however they are clearly separated based on the contentType (or mimeType)
parse-html - only provides 1
parse-js - only provides 1
parse-mp3 - only provides 1
parse-mspowerpoint - only provides 1
parse-msword - only provides 1
parse-pdf - only provides 1
parse-rss - only provides 1
parse-rtf - only provides 1
parse-text - only provides 1
parse-zip - only provides 1

Thus, all but one of the existing parser plugins only provides 1 parsing implementation. Furthermore, even in the case where a parsing plugin provides 2 implementations, as in the case of parse-ext, the way that NUTCH-88 works right now still is able to deal with that situation, as long as the 2 parsing implementations are different classes (on the other hand, you can see why this wouldn't be an issue if both parsing implementations used the same class to handle different mimeTypes, as in the case of parse-ext) and both handle different "mimeTypes", or as they are described in the plugin.xml, "contentTypes". Say we encounter the mimeType "foo", and we have a parse-foo plugin, which provides two parsing implementation classes (e.g., classes that implement the org.apache.nutch.parse.Parser) extension point interface, A, and B, which are different classes, and that A handles the mimeType "foo2", and that B handles the mimeType "foo". Okay, then consider that in the parse-plugins.xml file we have mapped the mimeType "foo" to the "parse-foo" plugin. Okay, so, then what happens now after NUTCH-88 is that when foo is encountered by the protocol plugins, and then the ParserFactory is called to get a parser and then parse the content returned from protocol land, the way that the parser factory works is that it would obtain an prioritized list of all the Parser implementation clasess that for the parsing plugins that were mapped to the mimeType "foo", AND claim can handle "foo". So, even if parse-foo in our example provides A, and B parser implementations as a plugin, and even though we mapped the "plugin" "parse-foo" to the mimeType "foo" via parse-plugins.xml, the only parser implementation will get returned is "B" becaue "B" is the only plugin that actually claims it can deal with "foo". Thus, NUTCH-88 still provides the as-intended behavior in my mind to deal the issue that you claim is a bug.

Or, am I missing something here?

Thanks,
  Chris



> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359753 ] 

Doug Cutting commented on NUTCH-133:
------------------------------------

Stefan,

The primary reason to keep classes and method names the same is to simplify the evaluation of your patch.  A good patch should solve only one problem, and should change nothing unrelated to that problem.  Changes in indentation, etc. just make it harder for others to see what's really changed.  Cosmetic changes should be separate patches.

Jerome,

The extension and the declared content-type should both be used as hints to direct checking of magic.  If we have a known extension or content-type then we do not have to scan the entire list of mime types, but can rather first check the type(s) named by the extension and the content-type.  If these match the content then we're done.  This is an important optimization.  Only if those matches fail should we ever try matching all magic.  Does that make sense?

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359624 ] 

Doug Cutting commented on NUTCH-133:
------------------------------------

It would be great to have some junit tests which illustrate these problems.  If we can first all agree on the desired behaviour, then we can work on the appropriate fixes.  For example, we should have some tests which call ParseUtil.parse(Content) with various Content instances and check that these are parsed as we feel they should be.  Can you look at the failure cases from your test set and convert these to unit tests?  That way in the future we can be more certain that changes to the parser selection algorithm don't hurt the percentage of content that we can parse.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also content-length) change:  String key = line.substring(0, colonIndex); to  String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so  normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or content.getContentType();
> Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in case the header does not contains any content type informations or had problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the MimeTypes util is used. Since my patch move the mime type detection to the parser factory - where from my point of view - is the right place, it is now unneccary code we can remove from the protocol plugins. I never found a case where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less chance to get the patch into the sources, I suggest we open a low priority issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira