You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/09/07 17:31:00 UTC

Subcollection Plugin issue - Branch 1.2

Hi all,

 

I am having a small issue with subcollection plugin.  I am using 1.2
branch but  I believe I have seen this in 1.1 also.

I have the following XML for this plugin.

 

<subcollections>

<subcollection>

<name>stylebook</name>

<id>sb</id>

<whitelist>http://mysite.mydomain.com/guidance/</whitelist>

<whitelist>http://mysite/guidance/</whitelist>

<blacklist></blacklist>

</subcollection>

<subcollection>

<name>briefingbooks</name>

<id>bbs</id>

<whitelist>http://mysite.mydomain.com
/research/briefing_books</whitelist>

<whitelist>http://mysite/research/briefing_books</whitelist>

<whitelist>http:// mysite.mydomain.com/wx</whitelist>

<whitelist>http://mysite/wx</whitelist>

<whitelist>http://mysite.mydomain.com/wx/inthenews</whitelist>

<whitelist>http://mysite/wx/inthenews</whitelist>

<blacklist></blacklist>

</subcollection>

</subcollections>

 

Here is the relevant part of my urlregex filter file

 

# allow urls in foofactory.fi domain (or lucidimagination.com...)

#+^http://([a-z0-9\-A-Z]*\.)*turner.com/

+^http://mysite\.mydomain\.com/guidance/

+^http:// mysite /guidance/

+^http:// mysite \.mydomain\.com/research/briefing_books/

+^http:// mysite /research/briefing_books/

+^http:// mysite /wx/

+^http:// mysite \. mydomain \.com/wx/

# deny anything else 

-.

                                          

My seed Urls are

 

http://mysite.mydomain.com/guidance/wiki/index.php/Stylebook

http:// mysite. mydomain.com/research/briefing_books/index.htm

 

The problem I have is with the second subcollection.  Any urls that
start with this pattern http://mysite.mydomain.com/wx/inthenews or
http://mysite/wx/inthenews are not getting the subcollection name
"briefingbooks" in the subcollection field.  

On the other hand, the other three white listed urls, namely
http://mysite.mydomain.com /research/briefing_books,
http://mysite/research/briefing_books, http://mysite/wx are getting the
subcollection name in the subcollection field properly.  

 

Any idea why this could be happening?  I appreciate your help 

 


RE: Subcollection Plugin issue - Branch 1.2

Posted by "Nemani, Raj" <Ra...@turner.com>.
First of all,   Sorry for the long email.  Hope this helps in somebody seeing the issue right away before I have to fumble through :).  It is eminently possible that I am totally missing something.

Now to Markus' question.  Yes, they are in the index (both Nutch and Solr).  I am trying to debug this but what I am seeing does not make sense.  I put log statements in SubCOllection.java that iterates through the whitelist and I do not see the code iterating through all the whitelisted patterns for a given collection.  And whatever it does, it does twice.  It may be because I have two sub collections. It That may be the reason but the iterated whitlelist for each collection is no where near close.  I expected the "Patten" log statements to be more than one (because in one collection I have two patterns and in the other I have 5-6, please see below in the thread)

Here is the log for one URL

2010-09-07 14:52:11,074 INFO  collection.Subcollection - URL trying to match:http://mysite.mydomain.com/guidance/wiki/index.php/STYLE-BOOK_H
2010-09-07 14:52:11,074 INFO  collection.Subcollection - URL trying to match:http:// mysite.mydomain.com /guidance/wiki/index.php/STYLE-BOOK_H

2010-09-07 14:52:11,074 INFO  collection.Subcollection - Pattern:http:// mysite.mydomain.com /research/briefing_books/
2010-09-07 14:52:11,074 INFO  collection.Subcollection - Pattern:http:// mysite.mydomain.com /research/briefing_books/

2010-09-07 14:52:11,074 INFO  collection.Subcollection - Url not matched
2010-09-07 14:52:11,074 INFO  collection.Subcollection - Url not matched

Below I am including is the snippet of code I am talking about from Subcollection.java in subcollection plugin.  Basically SubcollectionIndexingfilter.java calls GetSubcollections method in COllectionsManager for each subcollection.
GetSubcollections method calls the following filter method for subcollection.java.  I just included log statements in there.  I have Eclipse set up on 1.1 but not on 1.2 yet. Eclipse does not come naturally to me because I am from C# and VS background. I can just get away with Java.  Any help is appreciated.  I have a feeling that the whitelist and backlists are not entirely getting populated as the code goes through each subcollection trying to analyze if a URL matches the pattern or not

public String filter(String urlString) {
    // first the blacklist
    Iterator i = blackList.iterator();
    while (i.hasNext()) {
      String row = (String) i.next();
      if (urlString.indexOf(row) != -1)
        return null;
    }

    // then whitelist
    i = whiteList.iterator();
	if (LOG.isInfoEnabled()) { LOG.info("URL trying to match" +urlString); }
    while (i.hasNext()) {
      String row = (String) i.next();
	  if (LOG.isInfoEnabled()) { LOG.info("Pattern" + row); } 
      if (urlString.indexOf(row) != -1) {
		if (LOG.isInfoEnabled()) { LOG.info("matched"); }
        return urlString;
	  }
	
    }
	if (LOG.isInfoEnabled()) { LOG.info("Url not matched"); }
    return null;
  }

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Tuesday, September 07, 2010 3:01 PM
To: user@nutch.apache.org
Subject: RE: Subcollection Plugin issue - Branch 1.2

Just a guess; did the fetcher really fetch documents from those urls?
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Tue 07-09-2010 17:32
To: user@nutch.apache.org; 
Subject: Subcollection Plugin issue - Branch 1.2

Hi all,



I am having a small issue with subcollection plugin.  I am using 1.2
branch but  I believe I have seen this in 1.1 also.

I have the following XML for this plugin.



<subcollections>

<subcollection>

<name>stylebook</name>

<id>sb</id>

<whitelist>http://mysite.mydomain.com/guidance/</whitelist>

<whitelist>http://mysite/guidance/</whitelist>

<blacklist></blacklist>

</subcollection>

<subcollection>

<name>briefingbooks</name>

<id>bbs</id>

<whitelist>http://mysite.mydomain.com
/research/briefing_books</whitelist>

<whitelist>http://mysite/research/briefing_books</whitelist>

<whitelist>http:// mysite.mydomain.com/wx</whitelist>

<whitelist>http://mysite/wx</whitelist>

<whitelist>http://mysite.mydomain.com/wx/inthenews</whitelist>

<whitelist>http://mysite/wx/inthenews</whitelist>

<blacklist></blacklist>

</subcollection>

</subcollections>



Here is the relevant part of my urlregex filter file



# allow urls in foofactory.fi domain (or lucidimagination.com...)

#+^http://([a-z0-9\-A-Z]*\.)*turner.com/

+^http://mysite\.mydomain\.com/guidance/

+^http:// mysite /guidance/

+^http:// mysite \.mydomain\.com/research/briefing_books/

+^http:// mysite /research/briefing_books/

+^http:// mysite /wx/

+^http:// mysite \. mydomain \.com/wx/

# deny anything else 

-.

                                         

My seed Urls are



http://mysite.mydomain.com/guidance/wiki/index.php/Stylebook

http:// mysite. mydomain.com/research/briefing_books/index.htm



The problem I have is with the second subcollection.  Any urls that
start with this pattern http://mysite.mydomain.com/wx/inthenews or
http://mysite/wx/inthenews are not getting the subcollection name
"briefingbooks" in the subcollection field.  

On the other hand, the other three white listed urls, namely
http://mysite.mydomain.com /research/briefing_books,
http://mysite/research/briefing_books, http://mysite/wx are getting the
subcollection name in the subcollection field properly.  



Any idea why this could be happening?  I appreciate your help 




RE: Subcollection Plugin issue - Branch 1.2

Posted by "Nemani, Raj" <Ra...@turner.com>.
I think I resolved the issue
The way to setup the subcollections.xml is NOT this

<subcollections>
<subcollection>
<name>stylebook</name>
<id>sb</id>
<whitelist>http://mysite.mydomain.com/guidance/</whitelist>
<whitelist>http://mysite/guidance/</whitelist>
<blacklist></blacklist>
</subcollection>
</subcollections>

It needs to be setup the following way.   

<?xml version="1.0"?>
<subcollections>
<subcollection>
<name>stylebook</name>
<id>sb</id>
<whitelist>http://cnnlibrary.turner.com/guidance/
http://cnnlibrary/guidance/</whitelist>
<blacklist></blacklist>
</subcollection>

Each pattern needs to separated by a \n (line feed or similar depending on the OS you are on).  This is not the way the example package.html was created in source folder for the plugin leading to confusion on my part.  The Subcollection template xml that comes with the source will work as long as you realize that the line feed after each pattern in either of white or black lists is there for a reason.  Otherwise the tokenizer code in the plugin will not work.  If you don't know this then you might remove the line feed without realizing .  If you have just one pattern in the list then you will not see the issue obviously.  I wonder if line feed is a good tokenizer in general anyway. Multiple white/back list elements would have been more appropriate given that it is an XML file already.  It would be great if few lines of clarification either in Package.html and/or in subcollection template xml file could be added in the 1.2 branch.  

Thank you
Raj

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Tuesday, September 07, 2010 3:01 PM
To: user@nutch.apache.org
Subject: RE: Subcollection Plugin issue - Branch 1.2

Just a guess; did the fetcher really fetch documents from those urls?
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Tue 07-09-2010 17:32
To: user@nutch.apache.org; 
Subject: Subcollection Plugin issue - Branch 1.2

Hi all,



I am having a small issue with subcollection plugin.  I am using 1.2
branch but  I believe I have seen this in 1.1 also.

I have the following XML for this plugin.



<subcollections>

<subcollection>

<name>stylebook</name>

<id>sb</id>

<whitelist>http://mysite.mydomain.com/guidance/</whitelist>

<whitelist>http://mysite/guidance/</whitelist>

<blacklist></blacklist>

</subcollection>

<subcollection>

<name>briefingbooks</name>

<id>bbs</id>

<whitelist>http://mysite.mydomain.com
/research/briefing_books</whitelist>

<whitelist>http://mysite/research/briefing_books</whitelist>

<whitelist>http:// mysite.mydomain.com/wx</whitelist>

<whitelist>http://mysite/wx</whitelist>

<whitelist>http://mysite.mydomain.com/wx/inthenews</whitelist>

<whitelist>http://mysite/wx/inthenews</whitelist>

<blacklist></blacklist>

</subcollection>

</subcollections>



Here is the relevant part of my urlregex filter file



# allow urls in foofactory.fi domain (or lucidimagination.com...)

#+^http://([a-z0-9\-A-Z]*\.)*turner.com/

+^http://mysite\.mydomain\.com/guidance/

+^http:// mysite /guidance/

+^http:// mysite \.mydomain\.com/research/briefing_books/

+^http:// mysite /research/briefing_books/

+^http:// mysite /wx/

+^http:// mysite \. mydomain \.com/wx/

# deny anything else 

-.

                                         

My seed Urls are



http://mysite.mydomain.com/guidance/wiki/index.php/Stylebook

http:// mysite. mydomain.com/research/briefing_books/index.htm



The problem I have is with the second subcollection.  Any urls that
start with this pattern http://mysite.mydomain.com/wx/inthenews or
http://mysite/wx/inthenews are not getting the subcollection name
"briefingbooks" in the subcollection field.  

On the other hand, the other three white listed urls, namely
http://mysite.mydomain.com /research/briefing_books,
http://mysite/research/briefing_books, http://mysite/wx are getting the
subcollection name in the subcollection field properly.  



Any idea why this could be happening?  I appreciate your help 




RE: Subcollection Plugin issue - Branch 1.2

Posted by Markus Jelsma <ma...@buyways.nl>.
Just a guess; did the fetcher really fetch documents from those urls?
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Tue 07-09-2010 17:32
To: user@nutch.apache.org; 
Subject: Subcollection Plugin issue - Branch 1.2

Hi all,



I am having a small issue with subcollection plugin.  I am using 1.2
branch but  I believe I have seen this in 1.1 also.

I have the following XML for this plugin.



<subcollections>

<subcollection>

<name>stylebook</name>

<id>sb</id>

<whitelist>http://mysite.mydomain.com/guidance/</whitelist>

<whitelist>http://mysite/guidance/</whitelist>

<blacklist></blacklist>

</subcollection>

<subcollection>

<name>briefingbooks</name>

<id>bbs</id>

<whitelist>http://mysite.mydomain.com
/research/briefing_books</whitelist>

<whitelist>http://mysite/research/briefing_books</whitelist>

<whitelist>http:// mysite.mydomain.com/wx</whitelist>

<whitelist>http://mysite/wx</whitelist>

<whitelist>http://mysite.mydomain.com/wx/inthenews</whitelist>

<whitelist>http://mysite/wx/inthenews</whitelist>

<blacklist></blacklist>

</subcollection>

</subcollections>



Here is the relevant part of my urlregex filter file



# allow urls in foofactory.fi domain (or lucidimagination.com...)

#+^http://([a-z0-9\-A-Z]*\.)*turner.com/

+^http://mysite\.mydomain\.com/guidance/

+^http:// mysite /guidance/

+^http:// mysite \.mydomain\.com/research/briefing_books/

+^http:// mysite /research/briefing_books/

+^http:// mysite /wx/

+^http:// mysite \. mydomain \.com/wx/

# deny anything else 

-.

                                         

My seed Urls are



http://mysite.mydomain.com/guidance/wiki/index.php/Stylebook

http:// mysite. mydomain.com/research/briefing_books/index.htm



The problem I have is with the second subcollection.  Any urls that
start with this pattern http://mysite.mydomain.com/wx/inthenews or
http://mysite/wx/inthenews are not getting the subcollection name
"briefingbooks" in the subcollection field.  

On the other hand, the other three white listed urls, namely
http://mysite.mydomain.com /research/briefing_books,
http://mysite/research/briefing_books, http://mysite/wx are getting the
subcollection name in the subcollection field properly.  



Any idea why this could be happening?  I appreciate your help