You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/08/01 00:40:28 UTC

Re: SolrClean not available in nutch 2.x

Hi Claudiu,
Can you please attach your new patch if possible to the issue and we can
try it out. I would be keen to get this in to the codebase.
Thank you very much for getting back here.
Best
Lewis


On Wed, Jul 31, 2013 at 2:42 PM, claudiuchis <cl...@gmail.com>wrote:

> Hi Lewis,
>
> The SolrClean utility is working fine.
>
> The problem was on my side, i.e. I did the initial crawling with a
> crawl_id,
> but this id was not picked when running solr clean (I didn't have this id
> in
> nutch-site.xml).
>
> I found this out when I got to the StorageUtils.java and saw how the web
> store was created.
>
>     String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
>
>     if (!crawlId.isEmpty()) {
>       conf.set("schema.prefix", crawlId + "_");
>     } else {
>       conf.set("schema.prefix", "");
>     }
>
> This was the reason the map method in the CleanMapper was not called, as
> the
> web store was empty (was using the one without prefix which didn't exist).
>
> I have now added the crawl_id to nutch-site.xml, so the correct web store
> is
> used when doing a solr clean.
>
> Claudiu.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081757.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: SolrClean not available in nutch 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Thanks.
Great job.


On Wed, Jul 31, 2013 at 5:45 PM, claudiuchis <cl...@gmail.com>wrote:

> Hi Lewis,
>
> I've created patch NUTCH-1294-v3.patch.
> Here are the steps I followed:
>
> $ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1
> $ cd release-2.2.1
> $ patch -p0 < NUTCH-1294-v2.patch
> # manually patched "src/bin/nutch" and "conf/log4j.properties"
> $ ant
> $ svn diff > NUTCH-1294-v3.patch
> # attached the new patch up on jira
>
> With this patch, all the files in the patch are deployed successfully. In
> the previous patch (v2), "src/bin/nutch" and "conf/log4j.properties" had to
> be patched manually.
>
> As I said, the task is working fine, i.e. documents with status = 3 are
> removed from Solr.
> The only caveat is that you need to set storage.crawl.id in nutch-site.xml
> if the crawling was done with a crawl_id, otherwise the solr clean task
> will
> not do anything.
>
> Thanks,
> Claudiu.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081790.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: SolrClean not available in nutch 2.x

Posted by Julien Nioche <li...@gmail.com>.
Hi Claudiu,

You definitely got the idea but it is usually better to generate a patch
against the 'live' branch and not a release one as the code could have
changed since the latest release. The live branch for 2.x is on

*https://svn.apache.org/repos/asf/nutch/branches/2.x*
*
*
Julien
*
*
On 1 August 2013 01:45, claudiuchis <cl...@gmail.com> wrote:

> Hi Lewis,
>
> I've created patch NUTCH-1294-v3.patch.
> Here are the steps I followed:
>
> $ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1
> $ cd release-2.2.1
> $ patch -p0 < NUTCH-1294-v2.patch
> # manually patched "src/bin/nutch" and "conf/log4j.properties"
> $ ant
> $ svn diff > NUTCH-1294-v3.patch
> # attached the new patch up on jira
>
> With this patch, all the files in the patch are deployed successfully. In
> the previous patch (v2), "src/bin/nutch" and "conf/log4j.properties" had to
> be patched manually.
>
> As I said, the task is working fine, i.e. documents with status = 3 are
> removed from Solr.
> The only caveat is that you need to set storage.crawl.id in nutch-site.xml
> if the crawling was done with a crawl_id, otherwise the solr clean task
> will
> not do anything.
>
> Thanks,
> Claudiu.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081790.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: SolrClean not available in nutch 2.x

Posted by claudiuchis <cl...@gmail.com>.
Hi Lewis,

I've created patch NUTCH-1294-v3.patch. 
Here are the steps I followed:

$ svn checkout http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1
$ cd release-2.2.1
$ patch -p0 < NUTCH-1294-v2.patch
# manually patched "src/bin/nutch" and "conf/log4j.properties"
$ ant
$ svn diff > NUTCH-1294-v3.patch
# attached the new patch up on jira

With this patch, all the files in the patch are deployed successfully. In
the previous patch (v2), "src/bin/nutch" and "conf/log4j.properties" had to
be patched manually.

As I said, the task is working fine, i.e. documents with status = 3 are
removed from Solr.
The only caveat is that you need to set storage.crawl.id in nutch-site.xml
if the crawling was done with a crawl_id, otherwise the solr clean task will
not do anything.

Thanks,
Claudiu.






--
View this message in context: http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081790.html
Sent from the Nutch - User mailing list archive at Nabble.com.