You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ulysses Rangel Ribeiro <ul...@gmail.com> on 2010/01/08 19:07:01 UTC

Purging from Nutch after indexing with Solr

I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.

Considering the following sample segment:

2.0G    content
196K    crawl_fetch
152K    crawl_generate
376K    crawl_parse
392K    parse_data
441M    parse_text

1. From what I have found through searches "content" holds the raw fetched
content, is there any problem if I remove it, ie: does nutch needs it to
apply any sort of logic when re-crawling that content/url?

2. Previous question applies to parse_data and parse_text after i've called
nutch solrindex on that segment.

3. Using samples scritps and tutorials I'm always seeing invertlinks being
called over all segments, but its output mentions merging, when I
fetch/parse new segments can I call invertlinks only over them?

Thanks,

-- 
Ulysses Rangel Ribeiro

Re: Purging from Nutch after indexing with Solr

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-01-09 10:18, MilleBii wrote:
> @Andrzej,
>
> To be more specific if one uses cached content (which I do), what is the
> "minimal" staff to keep, I guess :
> + crawl_fetch
> + parse_data
> + parse_text
>
> the rest is not used ... I guess, before I start testing could you confirm ?

crawl_fetch you can ignore - it's just the status of fetching, which 
should be by that time already integrated into crawldb (if you ran 
updatedb).

It's the content/ that you need to display cached view.

>
> @Ulysse,
>
> The other reason to keep all data is if you will need to reindex all
> segments, which does happen in development&  test phases, less in
> production  though.

Right. Also, a common practice is to keep the raw data for a while just 
to make sure that the parsing and indexing went smoothly (in case you 
need to re-parse the raw content).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Purging from Nutch after indexing with Solr

Posted by MilleBii <mi...@gmail.com>.
@Andrzej,

To be more specific if one uses cached content (which I do), what is the
"minimal" staff to keep, I guess :
+ crawl_fetch
+ parse_data
+ parse_text

the rest is not used ... I guess, before I start testing could you confirm ?

@Ulysse,

The other reason to keep all data is if you will need to reindex all
segments, which does happen in development & test phases, less in
production  though.



2010/1/8 Andrzej Bialecki <ab...@getopt.org>

> On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:
>
>> I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
>> questions regarding data redundancy with this setup.
>>
>> Considering the following sample segment:
>>
>> 2.0G    content
>> 196K    crawl_fetch
>> 152K    crawl_generate
>> 376K    crawl_parse
>> 392K    parse_data
>> 441M    parse_text
>>
>> 1. From what I have found through searches "content" holds the raw fetched
>> content, is there any problem if I remove it, ie: does nutch needs it to
>> apply any sort of logic when re-crawling that content/url?
>>
>
> No, they are no longer needed, unless you want to provide a "cached" view
> of the content.
>
>
>
>> 2. Previous question applies to parse_data and parse_text after i've
>> called
>> nutch solrindex on that segment.
>>
>
> Depends how you set up your search. If you search using NutchBean (i.e. the
> Nutch web application) then you need them. If you search using Solr, then
> you don't need them.
>
>
>
>> 3. Using samples scritps and tutorials I'm always seeing invertlinks being
>> called over all segments, but its output mentions merging, when I
>> fetch/parse new segments can I call invertlinks only over them?
>>
>
> Yes, invertlinks will incrementally merge the existing linkdb with new
> links from a new segment.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
-MilleBii-

Re: Purging from Nutch after indexing with Solr

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:
> I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
> questions regarding data redundancy with this setup.
>
> Considering the following sample segment:
>
> 2.0G    content
> 196K    crawl_fetch
> 152K    crawl_generate
> 376K    crawl_parse
> 392K    parse_data
> 441M    parse_text
>
> 1. From what I have found through searches "content" holds the raw fetched
> content, is there any problem if I remove it, ie: does nutch needs it to
> apply any sort of logic when re-crawling that content/url?

No, they are no longer needed, unless you want to provide a "cached" 
view of the content.

>
> 2. Previous question applies to parse_data and parse_text after i've called
> nutch solrindex on that segment.

Depends how you set up your search. If you search using NutchBean (i.e. 
the Nutch web application) then you need them. If you search using Solr, 
then you don't need them.

>
> 3. Using samples scritps and tutorials I'm always seeing invertlinks being
> called over all segments, but its output mentions merging, when I
> fetch/parse new segments can I call invertlinks only over them?

Yes, invertlinks will incrementally merge the existing linkdb with new 
links from a new segment.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com