You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "wangxu (JIRA)" <ji...@apache.org> on 2007/05/30 02:05:15 UTC

[jira] Created: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get

contentType parse not correctly,,,,got empty content using readseg -get
-----------------------------------------------------------------------

                 Key: NUTCH-493
                 URL: https://issues.apache.org/jira/browse/NUTCH-493
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: java version "1.5.0_04"

Linux localhost 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux
            Reporter: wangxu


I am using nutch0.9.
I found lots of my crawled pages's contents are empty.
then I checked the log,and find the warnning accordingly:the ContentType is said to be "url=http://......",and cannot 
find a suitable parser for the page:


parser not found for contentType=
url=http://product.dangdang.com/product.aspx?product_id=490321


then most of this kind of pages's contents are empty.
but I didnot find any warn or error other than "timeout" from the fetcher log.

Can somebody explain me why?
many thanks!



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-493.
-------------------------------

    Resolution: Invalid
      Assignee: Doğacan Güney

This is not a bug. When fetcher was unable to fetch pages, it created empty content. Such empty contents are not parseable, hence what you are seeing in your log.

After NUTCH-443, fetcher will not create emtpy content for such pages, so you should not see them in your log anymore.

Also, please use nutch-user mailing list to ask questions.

> contentType parse not correctly,,,,got empty content using readseg -get
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-493
>                 URL: https://issues.apache.org/jira/browse/NUTCH-493
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: java version "1.5.0_04"
> Linux localhost 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux
>            Reporter: wangxu
>            Assignee: Doğacan Güney
>
> I am using nutch0.9.
> I found lots of my crawled pages's contents are empty.
> then I checked the log,and find the warnning accordingly:the ContentType is said to be "url=http://......",and cannot 
> find a suitable parser for the page:
> parser not found for contentType=
> url=http://product.dangdang.com/product.aspx?product_id=490321
> then most of this kind of pages's contents are empty.
> but I didnot find any warn or error other than "timeout" from the fetcher log.
> Can somebody explain me why?
> many thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.