You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt Timion <ad...@honda-search.com> on 2006/07/21 19:10:35 UTC
Why would a record be in the database but not show up in the results?
Does anyone have an idea why a record would be in the database but not show up in the results?
I have 400+ pages from a certain domain in my database (checked using bin/nutch admin ) yet when I search for the domain, titles to certain pages from the domain, or unique URLs from the domain no results come up.
I was thinking it might be the regex-urlfilter, but if they are already in the database wouldn't that discount the possibility of regex-urlfilter being the culprit?
BTW, all of my urls are fetched by creating fetchlists using FreeFetchlistTool
Re: Why would a record be in the database but not show up in the results?
Posted by Thomas Delnoij <di...@gmail.com>.
Matt,
it's the index that is used for searching, not the webdb.
What is the status of these pages in webdb? Likely they are not
fetched yet (DB_UNFETCHED), and thus can never be in your index.
These articles give very nice basic explanation of different concepts:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
HTH Thomas
On 7/22/06, Matt Timion <ad...@honda-search.com> wrote:
> Asking again hoping that someone can help me out.
>
> I have a number of pages from a certain domain in my database. I can verify
> this when I use the command:
>
> bin/nutch admin crawl/db -textdump text
>
> I then look at the text.pages file and it has nearly 800 pages from that
> domain in my database.
>
> yet when I search for content from that domain nothing comes up. Can anyone
> tell me why this would happen?
>
>
Why would a record be in the database but not show up in the results?
Posted by Matt Timion <ad...@honda-search.com>.
Asking again hoping that someone can help me out.
I have a number of pages from a certain domain in my database. I can verify
this when I use the command:
bin/nutch admin crawl/db -textdump text
I then look at the text.pages file and it has nearly 800 pages from that
domain in my database.
yet when I search for content from that domain nothing comes up. Can anyone
tell me why this would happen?