You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Shay Lawless <se...@gmail.com> on 2006/12/06 16:31:39 UTC

Full List of Metadata Fields

Hi all,

I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version
0.5.0-200611082313) to Index a collection of ARC files generated by a web
crawl using the Heritrix web crawler (Version 1.4.0).

When I check the metadata tag on the wera front-end the following list of
tags are displayed

ARC Identifier
URL
Time of Archival
Last Modified Time
Mime-Type
File Status
Content Checksum
HTTP Header

When I click on the explain link in the NutchWax front-end the following
list of tags are displayed

Segment
Digest
Date
ARCDate
Encoding
Collection
ARCName
ARCOffset
ContentLength
PrimaryType
subType
URL
Title
Boost

Is there a full list of the metadata fields that NutchWax/Nutch creates when
indexing? I'm particularly interested in tags relating to the actual content
on each page i.e. content type, description etc etc
When searching does NutchWax/Nutch search across such tags or just across
the parsed text of each page for occurances of keywords etc?

Any help you can provide would be greatly appreciated!

Shay

Nutch Re-crawl same file over and over again

Posted by "Armel T. Nene" <ar...@idna-solutions.com>.
Hi,

I have setup Nutch to crawl my local filesystem. I set a topN 20 and Depth
2. But when Nutch re-crawls, it re-crawls the same files over and over
again. The directory doesn't contain any other sub-directories, can someone
let me what might be the cause. There are more than 20 files in the
directory so why nutch only getting the same twenty files?

Thanks,

Armel


-----Original Message-----
From: Michael Stack [mailto:stack@archive.org] 
Sent: 06 December 2006 16:04
To: Shay Lawless
Cc: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org;
archive-access-discuss@lists.sourceforge.net
Subject: Re: [Archive-access-discuss] Full List of Metadata Fields

Hey Shay.

Some friendly advice.  Cross-posting a question will make you unpopular 
fast.   Its best to start on the most appropriate seeming list and only 
move on from there if you are getting no satisfaction.  The below 
question looks best at home over on the archive-access list.  Let me 
have a go at answering it there.

Yours,
St.Ack 


Shay Lawless wrote:
> Hi all,
>
> I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 
> 0.5.0-200611082313) to Index a collection of ARC files generated by a 
> web crawl using the Heritrix web crawler (Version 1.4.0).
>
> When I check the metadata tag on the wera front-end the following list 
> of tags are displayed
>
> ARC Identifier
> URL
> Time of Archival
> Last Modified Time
> Mime-Type
> File Status
> Content Checksum
> HTTP Header
>
> When I click on the explain link in the NutchWax front-end the 
> following list of tags are displayed
>
> Segment
> Digest
> Date
> ARCDate
> Encoding
> Collection
> ARCName
> ARCOffset
> ContentLength
> PrimaryType
> subType
> URL
> Title
> Boost
>
> Is there a full list of the metadata fields that NutchWax/Nutch 
> creates when indexing? I'm particularly interested in tags relating to 
> the actual content on each page i.e. content type, description etc etc
> When searching does NutchWax/Nutch search across such tags or just 
> across the parsed text of each page for occurances of keywords etc?
>
> Any help you can provide would be greatly appreciated!
>
> Shay
>  
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Archive-access-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>   




Re: [Archive-access-discuss] Full List of Metadata Fields

Posted by Michael Stack <st...@archive.org>.
Hey Shay.

Some friendly advice.  Cross-posting a question will make you unpopular 
fast.   Its best to start on the most appropriate seeming list and only 
move on from there if you are getting no satisfaction.  The below 
question looks best at home over on the archive-access list.  Let me 
have a go at answering it there.

Yours,
St.Ack 


Shay Lawless wrote:
> Hi all,
>
> I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 
> 0.5.0-200611082313) to Index a collection of ARC files generated by a 
> web crawl using the Heritrix web crawler (Version 1.4.0).
>
> When I check the metadata tag on the wera front-end the following list 
> of tags are displayed
>
> ARC Identifier
> URL
> Time of Archival
> Last Modified Time
> Mime-Type
> File Status
> Content Checksum
> HTTP Header
>
> When I click on the explain link in the NutchWax front-end the 
> following list of tags are displayed
>
> Segment
> Digest
> Date
> ARCDate
> Encoding
> Collection
> ARCName
> ARCOffset
> ContentLength
> PrimaryType
> subType
> URL
> Title
> Boost
>
> Is there a full list of the metadata fields that NutchWax/Nutch 
> creates when indexing? I'm particularly interested in tags relating to 
> the actual content on each page i.e. content type, description etc etc
> When searching does NutchWax/Nutch search across such tags or just 
> across the parsed text of each page for occurances of keywords etc?
>
> Any help you can provide would be greatly appreciated!
>
> Shay
>  
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Archive-access-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>