You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Prashant Shekar <sh...@gmail.com> on 2014/11/26 19:33:04 UTC

Query about indexing crawled data from Nutch to Solr

Hi,

I had a question about how data from raw crawled data from Nutch is indexed
into Solr. We crawled the Acadis dataset using Nutch and there were 47,580
files that it retrieved. However, while indexing these files into Solr,
only 2929 of these documents were actually indexed. I had 2 questions:

1) What can be the reasons why only 2929 out of 47,580 files were actually
indexed in Solr? Does Solr do some deduplication on its end that Nutch does
not?

2) While checking the number of unique URLs, I found that there were 12,201
unique URLs. We had used the URL as a key for Solr indexing. So, if there
were no errors while indexing to Solr, can the number of indexed files
still be less than 12,201?

Thanks,
Prasanth Iyer

Re: Query about indexing crawled data from Nutch to Solr

Posted by Alfonso Nishikawa <al...@gmail.com>.

Hi, Prasanth.

Thanks for the information. Sadly I only know about 2.2.1 / 2.3-SNAPSHOT,
and although I don't know Solr, I was looking for any other type of bug,
specially in 2.3-SNAPSHOT.

I hope here in the list someone can help you more than me :)

Thanks!

Alfonso

2014-11-27 2:34 GMT+01:00 Prashant Shekar <sh...@gmail.com>:

> Hi Alfonso,
>
> I was using Nutch 1.9. I did check the crawldb stats and this is what it
> says:
>
> CrawlDb statistics start: crawl/crawldb/
> Statistics for CrawlDb: crawl/crawldb/
> TOTAL urls:    20387
> retry 0:    19483
> retry 1:    845
> retry 2:    59
> min score:    0.0
> avg score:    1.12277434E-4
> max score:    1.012
> status 1 (db_unfetched):    1785
> *status 2 (db_fetched):    3542*
> *status 3 (db_gone):    10705*
> status 4 (db_redir_temp):    3674
> status 5 (db_redir_perm):    681
> CrawlDb statistics: done
>
> So, I believe only 3542 files are actually being fetched. Let me check the
> reason why the site is not allowing us to crawl the rest of the files.
>
> Thanks for all your help,
> Prasanth Iyer
>
> On Wed, Nov 26, 2014 at 5:29 PM, Alfonso Nishikawa <
> alfonso.nishikawa@gmail.com> wrote:
>
>> Hi, Prashant,
>>
>> What version of Nutch are you using?
>>
>> Regards,
>>
>> Alfonso Nishikawa
>>
>> 2014-11-26 19:33 GMT+01:00 Prashant Shekar <sh...@gmail.com>:
>>
>>> Hi,
>>>
>>> I had a question about how data from raw crawled data from Nutch is
>>> indexed into Solr. We crawled the Acadis dataset using Nutch and there were
>>> 47,580 files that it retrieved. However, while indexing these files into
>>> Solr, only 2929 of these documents were actually indexed. I had 2 questions:
>>>
>>> 1) What can be the reasons why only 2929 out of 47,580 files were
>>> actually indexed in Solr? Does Solr do some deduplication on its end that
>>> Nutch does not?
>>>
>>> 2) While checking the number of unique URLs, I found that there were
>>> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if
>>> there were no errors while indexing to Solr, can the number of indexed
>>> files still be less than 12,201?
>>>
>>> Thanks,
>>> Prasanth Iyer
>>>
>>
>>
>

Re: Query about indexing crawled data from Nutch to Solr

Posted by Prashant Shekar <sh...@gmail.com>.

Hi Alfonso,

I was using Nutch 1.9. I did check the crawldb stats and this is what it
says:

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:    20387
retry 0:    19483
retry 1:    845
retry 2:    59
min score:    0.0
avg score:    1.12277434E-4
max score:    1.012
status 1 (db_unfetched):    1785
*status 2 (db_fetched):    3542*
*status 3 (db_gone):    10705*
status 4 (db_redir_temp):    3674
status 5 (db_redir_perm):    681
CrawlDb statistics: done

So, I believe only 3542 files are actually being fetched. Let me check the
reason why the site is not allowing us to crawl the rest of the files.

Thanks for all your help,
Prasanth Iyer

On Wed, Nov 26, 2014 at 5:29 PM, Alfonso Nishikawa <
alfonso.nishikawa@gmail.com> wrote:

> Hi, Prashant,
>
> What version of Nutch are you using?
>
> Regards,
>
> Alfonso Nishikawa
>
> 2014-11-26 19:33 GMT+01:00 Prashant Shekar <sh...@gmail.com>:
>
>> Hi,
>>
>> I had a question about how data from raw crawled data from Nutch is
>> indexed into Solr. We crawled the Acadis dataset using Nutch and there were
>> 47,580 files that it retrieved. However, while indexing these files into
>> Solr, only 2929 of these documents were actually indexed. I had 2 questions:
>>
>> 1) What can be the reasons why only 2929 out of 47,580 files were
>> actually indexed in Solr? Does Solr do some deduplication on its end that
>> Nutch does not?
>>
>> 2) While checking the number of unique URLs, I found that there were
>> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if
>> there were no errors while indexing to Solr, can the number of indexed
>> files still be less than 12,201?
>>
>> Thanks,
>> Prasanth Iyer
>>
>
>

Re: Query about indexing crawled data from Nutch to Solr

Posted by Alfonso Nishikawa <al...@gmail.com>.

Hi, Prashant,

What version of Nutch are you using?

Regards,

Alfonso Nishikawa

2014-11-26 19:33 GMT+01:00 Prashant Shekar <sh...@gmail.com>:

> Hi,
>
> I had a question about how data from raw crawled data from Nutch is
> indexed into Solr. We crawled the Acadis dataset using Nutch and there were
> 47,580 files that it retrieved. However, while indexing these files into
> Solr, only 2929 of these documents were actually indexed. I had 2 questions:
>
> 1) What can be the reasons why only 2929 out of 47,580 files were actually
> indexed in Solr? Does Solr do some deduplication on its end that Nutch does
> not?
>
> 2) While checking the number of unique URLs, I found that there were
> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if
> there were no errors while indexing to Solr, can the number of indexed
> files still be less than 12,201?
>
> Thanks,
> Prasanth Iyer
>