You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Marek Bachmann (JIRA)" <ji...@apache.org> on 2011/08/24 16:10:29 UTC

[jira] [Created] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

LinkDb (invertlinks) should inform the user when it ignores internal links
--------------------------------------------------------------------------

                 Key: NUTCH-1090
                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
             Project: Nutch
          Issue Type: Improvement
          Components: linkdb
    Affects Versions: 1.3
            Reporter: Marek Bachmann
            Priority: Trivial
             Fix For: 1.3


I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 

Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 

I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090257#comment-13090257 ] 

Markus Jelsma commented on NUTCH-1090:
--------------------------------------

You can patch o.a.n.crawl.LinkDB.configure() to log this information.

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Marek Bachmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marek Bachmann updated NUTCH-1090:
----------------------------------

    Attachment: LinkDb.patch

Inserted a {{LOG.info}} command in the {{invert}} method when db.ignore.internal.links is set to true.
Added a constant value {{IGNORE_INTERNAL_LINKS}} for the {{"db.ignore.internal.links"}} string.
Moved the creation of the {{JobConf}} object at the top of the {{invert}} method

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Markus Jelsma (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1090:
------------------------------------

    Assignee: Markus Jelsma
    
> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Assignee: Markus Jelsma
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.5
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Julien Nioche (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1090:
---------------------------------

    Fix Version/s:     (was: 1.3)
                   1.5
    
> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.5
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090267#comment-13090267 ] 

Markus Jelsma edited comment on NUTCH-1090 at 8/24/11 2:48 PM:
---------------------------------------------------------------

Yes, the job object is created there. The can then be read like in the 
configure method.


      was (Author: markus17):
    Yes, the job object is created there. The can then be read like in the 
configure method.


-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

  
> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Marek Bachmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marek Bachmann updated NUTCH-1090:
----------------------------------

    Attachment:     (was: LinkDb.patch)

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Marek Bachmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090260#comment-13090260 ] 

Marek Bachmann commented on NUTCH-1090:
---------------------------------------

Then I did it right. Thanks

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150414#comment-13150414 ] 

Hudson commented on NUTCH-1090:
-------------------------------

Integrated in nutch-trunk-maven #26 (See [https://builds.apache.org/job/nutch-trunk-maven/26/])
    NUTCH-1090 InvertLinks should inform when ignoring internal links

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1202143
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java

                
> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Assignee: Markus Jelsma
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.5
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Marek Bachmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marek Bachmann updated NUTCH-1090:
----------------------------------

    Attachment: LinkDb.patch

I inserted a {{LOG.info}} message in the {{configure}} method.
I Don't think that this is the best place but the {{ignoreInternalLinks}} variable isn't set before this method was called

Hope the format of the patch file is correct. I never posted one before :)

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090267#comment-13090267 ] 

Markus Jelsma commented on NUTCH-1090:
--------------------------------------

Yes, the job object is created there. The can then be read like in the 
configure method.


-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Markus Jelsma (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1090.
----------------------------------

    Resolution: Fixed

Committed for 1.4 in rev. 1202143.
Thanks!
                
> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Assignee: Markus Jelsma
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.5
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151040#comment-13151040 ] 

Hudson commented on NUTCH-1090:
-------------------------------

Integrated in Nutch-trunk #1665 (See [https://builds.apache.org/job/Nutch-trunk/1665/])
    NUTCH-1090 InvertLinks should inform when ignoring internal links

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1202143
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java

                
> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Assignee: Markus Jelsma
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.5
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, the job object is created there. The can then be read like in the 
configure method.

On Wednesday 24 August 2011 16:40:29 Marek Bachmann (JIRA) wrote:
>     [
> https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.p
> lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090264#comm
> ent-13090264 ]
> 
> Marek Bachmann commented on NUTCH-1090:
> ---------------------------------------
> 
> Ok, I thought so too. But I was unsure that it is possible and how to read
> the conf from there. Will have a look at it.
> 
> > LinkDb (invertlinks) should inform the user when it ignores internal
> > links
> > ------------------------------------------------------------------------
> > --
> > 
> >                 Key: NUTCH-1090
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
> >             
> >             Project: Nutch
> >          
> >          Issue Type: Improvement
> >          Components: linkdb
> >    
> >    Affects Versions: 1.3
> >    
> >            Reporter: Marek Bachmann
> >            Priority: Trivial
> >            
> >              Labels: configuration, information, log
> >             
> >             Fix For: 1.3
> >         
> >         Attachments: LinkDb.patch
> > 
> > I used nutch to crawl sites on a single domain. After the crawl was
> > complete I tried to build a LinkDb. The LinkDb was empty. It comes up
> > that this happens because the invertlinks command ignores internal links
> > to the same domain by default. Unfortunately the LinkDb class doesn't
> > tell anything about that. So it was hard to find out why the LinkDb was
> > empty. I suggest to add an information for the user when the invertlinks
> > command is ignoring internal links.
> 
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Marek Bachmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090264#comment-13090264 ] 

Marek Bachmann commented on NUTCH-1090:
---------------------------------------

Ok, I thought so too. But I was unsure that it is possible and how to read the conf from there. Will have a look at it.

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090261#comment-13090261 ] 

Markus Jelsma commented on NUTCH-1090:
--------------------------------------

Looking at it i feel writing in the invert method is cleaner. You can read the configuration setting there as well.

> LinkDb (invertlinks) should inform the user when it ignores internal links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1090
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1090
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Marek Bachmann
>            Priority: Trivial
>              Labels: configuration, information, log
>             Fix For: 1.3
>
>         Attachments: LinkDb.patch
>
>
> I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. 
> It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. 
> Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. 
> I suggest to add an information for the user when the invertlinks command is ignoring internal links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira