You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ulysses Rangel Ribeiro <ul...@gmail.com> on 2010/01/08 19:07:01 UTC
Purging from Nutch after indexing with Solr
I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.
Considering the following sample segment:
2.0G content
196K crawl_fetch
152K crawl_generate
376K crawl_parse
392K parse_data
441M parse_text
1. From what I have found through searches "content" holds the raw fetched
content, is there any problem if I remove it, ie: does nutch needs it to
apply any sort of logic when re-crawling that content/url?
2. Previous question applies to parse_data and parse_text after i've called
nutch solrindex on that segment.
3. Using samples scritps and tutorials I'm always seeing invertlinks being
called over all segments, but its output mentions merging, when I
fetch/parse new segments can I call invertlinks only over them?
Thanks,
--
Ulysses Rangel Ribeiro
Re: Purging from Nutch after indexing with Solr
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-01-09 10:18, MilleBii wrote:
> @Andrzej,
>
> To be more specific if one uses cached content (which I do), what is the
> "minimal" staff to keep, I guess :
> + crawl_fetch
> + parse_data
> + parse_text
>
> the rest is not used ... I guess, before I start testing could you confirm ?
crawl_fetch you can ignore - it's just the status of fetching, which
should be by that time already integrated into crawldb (if you ran
updatedb).
It's the content/ that you need to display cached view.
>
> @Ulysse,
>
> The other reason to keep all data is if you will need to reindex all
> segments, which does happen in development& test phases, less in
> production though.
Right. Also, a common practice is to keep the raw data for a while just
to make sure that the parsing and indexing went smoothly (in case you
need to re-parse the raw content).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Purging from Nutch after indexing with Solr
Posted by MilleBii <mi...@gmail.com>.
@Andrzej,
To be more specific if one uses cached content (which I do), what is the
"minimal" staff to keep, I guess :
+ crawl_fetch
+ parse_data
+ parse_text
the rest is not used ... I guess, before I start testing could you confirm ?
@Ulysse,
The other reason to keep all data is if you will need to reindex all
segments, which does happen in development & test phases, less in
production though.
2010/1/8 Andrzej Bialecki <ab...@getopt.org>
> On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:
>
>> I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
>> questions regarding data redundancy with this setup.
>>
>> Considering the following sample segment:
>>
>> 2.0G content
>> 196K crawl_fetch
>> 152K crawl_generate
>> 376K crawl_parse
>> 392K parse_data
>> 441M parse_text
>>
>> 1. From what I have found through searches "content" holds the raw fetched
>> content, is there any problem if I remove it, ie: does nutch needs it to
>> apply any sort of logic when re-crawling that content/url?
>>
>
> No, they are no longer needed, unless you want to provide a "cached" view
> of the content.
>
>
>
>> 2. Previous question applies to parse_data and parse_text after i've
>> called
>> nutch solrindex on that segment.
>>
>
> Depends how you set up your search. If you search using NutchBean (i.e. the
> Nutch web application) then you need them. If you search using Solr, then
> you don't need them.
>
>
>
>> 3. Using samples scritps and tutorials I'm always seeing invertlinks being
>> called over all segments, but its output mentions merging, when I
>> fetch/parse new segments can I call invertlinks only over them?
>>
>
> Yes, invertlinks will incrementally merge the existing linkdb with new
> links from a new segment.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
-MilleBii-
Re: Purging from Nutch after indexing with Solr
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:
> I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
> questions regarding data redundancy with this setup.
>
> Considering the following sample segment:
>
> 2.0G content
> 196K crawl_fetch
> 152K crawl_generate
> 376K crawl_parse
> 392K parse_data
> 441M parse_text
>
> 1. From what I have found through searches "content" holds the raw fetched
> content, is there any problem if I remove it, ie: does nutch needs it to
> apply any sort of logic when re-crawling that content/url?
No, they are no longer needed, unless you want to provide a "cached"
view of the content.
>
> 2. Previous question applies to parse_data and parse_text after i've called
> nutch solrindex on that segment.
Depends how you set up your search. If you search using NutchBean (i.e.
the Nutch web application) then you need them. If you search using Solr,
then you don't need them.
>
> 3. Using samples scritps and tutorials I'm always seeing invertlinks being
> called over all segments, but its output mentions merging, when I
> fetch/parse new segments can I call invertlinks only over them?
Yes, invertlinks will incrementally merge the existing linkdb with new
links from a new segment.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com