You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Holmes <no...@gmail.com> on 2015/09/28 11:19:33 UTC

What kind of nutch documents does Solr index?

Hi,
I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing. In
my tests there is a gap between number of fetched results of Nutch and
number of indexed documents in Solr. For example one of the crawls is
fetched 23343 pages and 1146 images successfully while in the Solr 19250
docs is indexed and 500 of them is image urls.

My question is that what kind of pages are indexed is solr and why?
Does Solr index pages whit other status or not?
what kind of images does Solr index?

Thanks.

Re: What kind of nutch documents does Solr index?

Posted by NutchDev <nu...@gmail.com>.
What Nutch does is, after fetching document from server they are passed to
parser to parse and parser detects the document type and accordingly do the
parsing. 

One possibility could be parser had failed to parse some documents. and
that's why you are getting count mismatch. 



--
View this message in context: http://lucene.472066.n3.nabble.com/What-kind-of-nutch-documents-does-Solr-index-tp4231646p4232034.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What kind of nutch documents does Solr index?

Posted by Daniel Holmes <no...@gmail.com>.
Thank you Upayavira for your anser. In the case I described maxDoc is 19263.
As I check the Nutch, default indexing filter in Nutch is basic indexing
filter and also it have a property to delete gone and permanently
redirected pages which it value was false for me.
I think the problem is still remained for solr.


On Mon, Sep 28, 2015 at 3:03 PM, Upayavira <uv...@odoko.co.uk> wrote:

> I suspect you may be better off asking this on the Nutch user list. The
> decisions you are describing will be within the Nutch codebase, not
> Solr. Someone here may know (hopefully) but you may get more support
> over on the Nutch list.
>
> One suggestion -start with a clean, empty index. Run a crawl. Look at
> the maxDocs vs numDocs (visible via the admin UI for your
> core/collection). If maxDocs>numDocs, it means that some docs have been
> overwritten - i.e. the ID field that Nutch is using is not unique.
>
> Upayavira
>
> On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote:
> > Hi,
> > I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing.
> > In
> > my tests there is a gap between number of fetched results of Nutch and
> > number of indexed documents in Solr. For example one of the crawls is
> > fetched 23343 pages and 1146 images successfully while in the Solr 19250
> > docs is indexed and 500 of them is image urls.
> >
> > My question is that what kind of pages are indexed is solr and why?
> > Does Solr index pages whit other status or not?
> > what kind of images does Solr index?
> >
> > Thanks.
>

Re: What kind of nutch documents does Solr index?

Posted by Upayavira <uv...@odoko.co.uk>.
I suspect you may be better off asking this on the Nutch user list. The
decisions you are describing will be within the Nutch codebase, not
Solr. Someone here may know (hopefully) but you may get more support
over on the Nutch list.

One suggestion -start with a clean, empty index. Run a crawl. Look at
the maxDocs vs numDocs (visible via the admin UI for your
core/collection). If maxDocs>numDocs, it means that some docs have been
overwritten - i.e. the ID field that Nutch is using is not unique.

Upayavira

On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote:
> Hi,
> I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing.
> In
> my tests there is a gap between number of fetched results of Nutch and
> number of indexed documents in Solr. For example one of the crawls is
> fetched 23343 pages and 1146 images successfully while in the Solr 19250
> docs is indexed and 500 of them is image urls.
> 
> My question is that what kind of pages are indexed is solr and why?
> Does Solr index pages whit other status or not?
> what kind of images does Solr index?
> 
> Thanks.