You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matt Timion <ad...@honda-search.com> on 2006/07/21 19:10:35 UTC

Why would a record be in the database but not show up in the results?

Does anyone have an idea why a record would be in the database but not show up in the results?

I have 400+ pages from a certain domain in my database (checked using bin/nutch admin ) yet when I search for the domain, titles to certain pages from the domain, or unique URLs from the domain no results come up.

I was thinking it might be the regex-urlfilter, but if they are already in the database wouldn't that discount the possibility of regex-urlfilter being the culprit?

BTW, all of my urls are fetched by creating fetchlists using FreeFetchlistTool

Re: Why would a record be in the database but not show up in the results?

Posted by Thomas Delnoij <di...@gmail.com>.

Matt,

it's the index that is used for searching, not the webdb.

What is the status of these pages in webdb? Likely they are not
fetched yet (DB_UNFETCHED), and thus can never be in your index.

These articles give very nice basic explanation of different concepts:

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

HTH Thomas

On 7/22/06, Matt Timion <ad...@honda-search.com> wrote:
> Asking again hoping that someone can help me out.
>
> I have a number of pages from a certain domain in my database.  I can verify
> this when I use the command:
>
> bin/nutch admin crawl/db -textdump text
>
> I then look at the text.pages file and it has nearly 800 pages from that
> domain in my database.
>
> yet when I search for content from that domain nothing comes up.  Can anyone
> tell me why this would happen?
>
>

Why would a record be in the database but not show up in the results?

Posted by Matt Timion <ad...@honda-search.com>.

Asking again hoping that someone can help me out.

I have a number of pages from a certain domain in my database.  I can verify 
this when I use the command:

bin/nutch admin crawl/db -textdump text

I then look at the text.pages file and it has nearly 800 pages from that 
domain in my database.

yet when I search for content from that domain nothing comes up.  Can anyone 
tell me why this would happen?