You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by claudiuchis <cl...@gmail.com> on 2013/07/30 20:10:47 UTC

SolrClean not available in nutch 2.x

Hi folks,

I have recently upgraded from nutch 1.6 to 2.2.1, and realized that
SolrClean is not shipped with the newer version of nutch.
I want to be able to delete links with status = 3 (404's) from Solr, and
SolrClean did the job in nutch 1.x. 
Is there a way I can achieve the same result with nutch 2.x?

Any help is much appreciated.

Thanks,
Claudiu.




--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Thanks.
Great job.


On Wed, Jul 31, 2013 at 5:45 PM, claudiuchis <cl...@gmail.com>wrote:

> Hi Lewis,
>
> I've created patch NUTCH-1294-v3.patch.
> Here are the steps I followed:
>
> $ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1
> $ cd release-2.2.1
> $ patch -p0 < NUTCH-1294-v2.patch
> # manually patched "src/bin/nutch" and "conf/log4j.properties"
> $ ant
> $ svn diff > NUTCH-1294-v3.patch
> # attached the new patch up on jira
>
> With this patch, all the files in the patch are deployed successfully. In
> the previous patch (v2), "src/bin/nutch" and "conf/log4j.properties" had to
> be patched manually.
>
> As I said, the task is working fine, i.e. documents with status = 3 are
> removed from Solr.
> The only caveat is that you need to set storage.crawl.id in nutch-site.xml
> if the crawling was done with a crawl_id, otherwise the solr clean task
> will
> not do anything.
>
> Thanks,
> Claudiu.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081790.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: SolrClean not available in nutch 2.x

Posted by Julien Nioche <li...@gmail.com>.
Hi Claudiu,

You definitely got the idea but it is usually better to generate a patch
against the 'live' branch and not a release one as the code could have
changed since the latest release. The live branch for 2.x is on

*https://svn.apache.org/repos/asf/nutch/branches/2.x*
*
*
Julien
*
*
On 1 August 2013 01:45, claudiuchis <cl...@gmail.com> wrote:

> Hi Lewis,
>
> I've created patch NUTCH-1294-v3.patch.
> Here are the steps I followed:
>
> $ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1
> $ cd release-2.2.1
> $ patch -p0 < NUTCH-1294-v2.patch
> # manually patched "src/bin/nutch" and "conf/log4j.properties"
> $ ant
> $ svn diff > NUTCH-1294-v3.patch
> # attached the new patch up on jira
>
> With this patch, all the files in the patch are deployed successfully. In
> the previous patch (v2), "src/bin/nutch" and "conf/log4j.properties" had to
> be patched manually.
>
> As I said, the task is working fine, i.e. documents with status = 3 are
> removed from Solr.
> The only caveat is that you need to set storage.crawl.id in nutch-site.xml
> if the crawling was done with a crawl_id, otherwise the solr clean task
> will
> not do anything.
>
> Thanks,
> Claudiu.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081790.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: SolrClean not available in nutch 2.x

Posted by claudiuchis <cl...@gmail.com>.
Hi Lewis,

I've created patch NUTCH-1294-v3.patch. 
Here are the steps I followed:

$ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1
$ cd release-2.2.1
$ patch -p0 < NUTCH-1294-v2.patch
# manually patched "src/bin/nutch" and "conf/log4j.properties"
$ ant
$ svn diff > NUTCH-1294-v3.patch
# attached the new patch up on jira

With this patch, all the files in the patch are deployed successfully. In
the previous patch (v2), "src/bin/nutch" and "conf/log4j.properties" had to
be patched manually.

As I said, the task is working fine, i.e. documents with status = 3 are
removed from Solr.
The only caveat is that you need to set storage.crawl.id in nutch-site.xml
if the crawling was done with a crawl_id, otherwise the solr clean task will
not do anything.

Thanks,
Claudiu.






--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081790.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Claudiu,
Can you please attach your new patch if possible to the issue and we can
try it out. I would be keen to get this in to the codebase.
Thank you very much for getting back here.
Best
Lewis


On Wed, Jul 31, 2013 at 2:42 PM, claudiuchis <cl...@gmail.com>wrote:

> Hi Lewis,
>
> The SolrClean utility is working fine.
>
> The problem was on my side, i.e. I did the initial crawling with a
> crawl_id,
> but this id was not picked when running solr clean (I didn't have this id
> in
> nutch-site.xml).
>
> I found this out when I got to the StorageUtils.java and saw how the web
> store was created.
>
>     String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
>
>     if (!crawlId.isEmpty()) {
>       conf.set("schema.prefix", crawlId + "_");
>     } else {
>       conf.set("schema.prefix", "");
>     }
>
> This was the reason the map method in the CleanMapper was not called, as
> the
> web store was empty (was using the one without prefix which didn't exist).
>
> I have now added the crawl_id to nutch-site.xml, so the correct web store
> is
> used when doing a solr clean.
>
> Claudiu.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081757.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: SolrClean not available in nutch 2.x

Posted by claudiuchis <cl...@gmail.com>.
Hi Lewis,

The SolrClean utility is working fine.

The problem was on my side, i.e. I did the initial crawling with a crawl_id,
but this id was not picked when running solr clean (I didn't have this id in
nutch-site.xml).

I found this out when I got to the StorageUtils.java and saw how the web
store was created.

    String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
    
    if (!crawlId.isEmpty()) {
      conf.set("schema.prefix", crawlId + "_");
    } else {
      conf.set("schema.prefix", "");
    }

This was the reason the map method in the CleanMapper was not called, as the
web store was empty (was using the one without prefix which didn't exist).

I have now added the crawl_id to nutch-site.xml, so the correct web store is
used when doing a solr clean.

Claudiu.



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081757.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Posted by claudiuchis <cl...@gmail.com>.
Hi Lewis,

Yep, I rebuilt nutch after applying the patch. I forgot to mention.
I'll issue another patch that will update the 2 files.

Still debugging it though.

Thanks,
Claudiu.



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081567.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Tue, Jul 30, 2013 at 3:29 PM, claudiuchis <cl...@gmail.com>wrote:
...
snip
...

> 4. I applied the patch
>
> cd /usr/local/nutch-2.2.1
> patch -p0 < NUTCH-1294-v2.patch
>
> The patch didn't update "src/bin/nutch" and "conf/log4j.properties" for
> some
> reason. I've updated these manually.
>

If you are able to add these to a new patch and attach it to the issue it
would be greatly appreciated.

Did you generate a new job file? ant job


>
>

Re: SolrClean not available in nutch 2.x

Posted by claudiuchis <cl...@gmail.com>.
Hi Lewis,

It didn't work for me.

Here is what I did:

1. I set up a test web site on my local machine.

2. I crawled the site, removed one page, and crawled again.

3. Checked that the page I removed was indexed by Solr, and was flagged as
gone (status = 3) in the database (hbase)

hbase> scan 'webpage', {COLUMNS => ['f:st']}

localhost:http:3000/ column=f:st, timestamp=1375203394614,
value=\x00\x00\x00\x
 tests                03 

4. I applied the patch

cd /usr/local/nutch-2.2.1
patch -p0 < NUTCH-1294-v2.patch

The patch didn't update "src/bin/nutch" and "conf/log4j.properties" for some
reason. I've updated these manually.

5. run the "solrclean" task in distributed mode:

$NUTCH_DEPLOY/bin/nutch solrclean http://localhost:8983/solr

*Expected result:* The "gone" document is removed from Solr.

*Actual result:* The document is still in Solr.

*Additional information:* I enabled logging for SolrClean and it's
dependencies:

log4j.logger.org.apache.nutch.indexer.solr.SolrClean=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleanerJob=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleaningFilters=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleaningFilter=INFO,cmdstdout

Then, I added a LOG.info("method-name"); line to each method in these 4
classes.
This way I found out that the map method in IndexCleanerJob class was not
called, so there were no documents processed.
I will look to find out why this is.

I run:
 - hadoop 1.1.2 (one machine)
 - nutch 2.2.1 with patch NUTCH-1294-v2
 - hbase 0.90.4
 - Java 1.7.0_21

Thanks,
Claudiu.



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081481.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I did not port I merely helped "a bit" :)
Dan Rosher was the driving force behind this one!
Thanks for any feedback.
Best


On Tue, Jul 30, 2013 at 11:23 AM, claudiuchis <cl...@gmail.com>wrote:

> Hi Lewis.
>
> Thank you for porting SolrClean to the 2.x branch.
> I'll apply the patch and let you know the outcome.
>
> Many thanks,
> Claudiu.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081395.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: SolrClean not available in nutch 2.x

Posted by claudiuchis <cl...@gmail.com>.
Hi Lewis. 

Thank you for porting SolrClean to the 2.x branch.
I'll apply the patch and let you know the outcome.

Many thanks,
Claudiu.



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081395.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.
https://issues.apache.org/jira/browse/NUTCH-1294
Would be really really great if you could try this out and comment on this
issue.
Another tool we would then need to port to pluggable indexing.
hth
Lewis


On Tue, Jul 30, 2013 at 11:10 AM, claudiuchis <cl...@gmail.com>wrote:

> Hi folks,
>
> I have recently upgraded from nutch 1.6 to 2.2.1, and realized that
> SolrClean is not shipped with the newer version of nutch.
> I want to be able to delete links with status = 3 (404's) from Solr, and
> SolrClean did the job in nutch 1.x.
> Is there a way I can achieve the same result with nutch 2.x?
>
> Any help is much appreciated.
>
> Thanks,
> Claudiu.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*