You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Thorsten Scherler (JIRA)" <ji...@apache.org> on 2006/11/24 14:24:01 UTC

[jira] Created: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Make Nutch crawling parent directories for file protocol configurable
---------------------------------------------------------------------

                 Key: NUTCH-407
                 URL: http://issues.apache.org/jira/browse/NUTCH-407
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.8
            Reporter: Thorsten Scherler


http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html

I am looking into fixing some very weird behavior of the file protocol.
I am using 0.8.

Researching this topic I found 
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
and
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

I am on Ubuntu but I have the same problem that nutch is going down the
tree (including parents) and not up (including children from the root
url).

Further I would vote to make the fetch-parents optional and defined per
a property whether I would like this not very intuitive "feature".


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453523 ] 
            
Andrzej Bialecki  commented on NUTCH-407:
-----------------------------------------

As far as I understand it, the original issue that you refer to (and your issue) both come from misconfigured URLFilters - I don't understand why this fix is needed if you configure them properly.

First, let's establish the names for directions - normally "up" refers to a parent directory, and "down" refers to a child directory.

Current behavior is to collect ANY urls that we find pointing out from the current URL, unless prohibited by filters. In case of crawling local FS, unless you prohibit it in URLFilters from collecting parent dirs it will also collect such URLs - that's why it behaved the way it did. This behavior is consistent with HTTP and FTP crawling.

So, instead of your "special case" fix you should simply put the root directory in your URLFilters configuration. E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt :

+^file:///c:/top/directory/
-.

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Godmar Back (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797398#action_12797398 ] 

Godmar Back commented on NUTCH-407:
-----------------------------------

Hardwiring a directory such as "file:///c:/top/directory" in urlfilter-regexp is a bizarre idea that does not solve this problem.

It would tie the 'urls' file specified in 'nutch crawl' to the urlfilter-regex in conf/, which is obviously bad. (Alternatively, if each 'urls' file required a separate conf/ directory, then 'urls' should be made part of conf/.)

Please accept the original patch or find a better solution.


> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: https://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>            Assignee: Andrzej Bialecki 
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453934 ] 
            
Chris A. Mattmann commented on NUTCH-407:
-----------------------------------------

I'm not entirey sure what the right answer to this is. One thing that I do know is that a colleague at my own work ran into this exact same issue while first attempting to use Nutch on his enterprise search application. Confused the heck out of him and he ended up including in the urlfilter-regex what Andrzej mentions above, i.e., only crawl from the top-level down. He mentioned to me that he thought this was a "kludge" and I can't say that I disagreed with him. My +1 for figuring  out a better way to solve this problem...

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Assigned To: Andrzej Bialecki 
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-407?page=all ]

Thorsten Scherler updated NUTCH-407:
------------------------------------

    Attachment: 407.fix.diff

This patch fixed the issue by letting the user decide whether or not to crawl parent dirs. The new property is file.crawl.parent and defined in nutch-default.xml.

I did not change the default behavior, meaning by default nutch will crawl the parents for the file protocol. 

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-407?page=all ]

Andrzej Bialecki  closed NUTCH-407.
-----------------------------------

    Resolution: Invalid
      Assignee: Andrzej Bialecki 

Intended behavior can be realized using URLFilters.

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Assigned To: Andrzej Bialecki 
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453530 ] 
            
Thorsten Scherler commented on NUTCH-407:
-----------------------------------------

Hi Andrzej, thanks for your answer.
http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10
I made a note in the FAQ, I was up to close this issue as "wont fix" but do not have the right to do so. 
Can someone close it?

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

Posted by "Alan Tanaman (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] 
            
Alan Tanaman commented on NUTCH-407:
------------------------------------

In our team we feel that this patch would have been beneficial in practical terms.  In the context of the enterprise intelligence solution which we are gradually porting over to Nutch, the emphasis is on ease of configuration.  We try to avoid exposing features such as regex filter, which although are very powerful for a more experienced user, are perhaps confusing to the novice.  This is because we are primarily focused on the enterprise and less on the WWW.

This is why we preconfigure the db.ignore.external.links property to "true", and then only the urls file is used to seed the crawl.

Our ideal is to have a collection of predefined configuration settings for specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, Enterprise-Database, Internet-News etc.  We have a script that generates multiple crawlers, each one with different sources to be crawled, and although possible, it isn't the most practical to change the filters for each one manually based on the individual user requirements.

I realise this patch is closed, but how about another approach that says that FileResponse.java looks at db.ignore.external.links and decides based on this whether to go up the tree.

Obviously, this would also prevent you from crawling outlinks to the WWW embedded in documents, but when crawling an enterprise file system, you usually don't want to go all over the place anyway.  As I see it, file systems are different to the web in that they are inherently hierarchical whereas the web is as its name implies, non-hierarchical.  Therefore, when crawling a file system, "going up" the tree is just as much an external URI (so to speak) as a link to a web site.

*Ducks for cover*

Alan

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Assigned To: Andrzej Bialecki 
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira