You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/07/21 11:38:25 UTC

tutorial work thru (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

Working thru the tutorial for v1 of nutch.
urls/seed.txt contains
https://the.website.mil/inside/

regex-urlfilter.txt contains edits...

# accept anything else
#+.

# limit to the.website.mil
+^https://([a-z0-9]*\.)the.website.mil/inside

Yet nothing gets populated in the crawl db...

bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-07-21 07:32:02
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 1
Injector: Total number of urls after normalization: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 0
Injector: Total new urls injected: 0

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


CLASSIFICATION: UNCLASSIFIED

RE: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)

Posted by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil>.
CLASSIFICATION: UNCLASSIFIED

Damn email policy.
You get it.

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Musshorn, Kris T CTR USARMY RDECOM ARL (US) [mailto:kris.t.musshorn.ctr@mail.mil] 
Sent: Thursday, July 21, 2016 8:02 AM
To: user@nutch.apache.org
Subject: RE: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

CLASSIFICATION: UNCLASSIFIED

Clarification...
Seed.txt contains
Caution-https://the.website.mil/inside/
not 
Caution-Caution-https://the.website.mil/inside/


Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Musshorn, Kris T CTR USARMY RDECOM ARL (US) [Caution-mailto:kris.t.musshorn.ctr@mail.mil] 
Sent: Thursday, July 21, 2016 7:38 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

CLASSIFICATION: UNCLASSIFIED

Working thru the tutorial for v1 of nutch.
urls/seed.txt contains
Caution-Caution-https://the.website.mil/inside/

regex-urlfilter.txt contains edits...

# accept anything else
#+.

# limit to the.website.mil
+^Caution-Caution-https://([a-z0-9]*\.)the.website.mil/inside

Yet nothing gets populated in the crawl db...

bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-07-21 07:32:02
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 1
Injector: Total number of urls after normalization: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 0
Injector: Total new urls injected: 0

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


CLASSIFICATION: UNCLASSIFIED


CLASSIFICATION: UNCLASSIFIED


CLASSIFICATION: UNCLASSIFIED

RE: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)

Posted by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil>.
CLASSIFICATION: UNCLASSIFIED

Clarification...
Seed.txt contains
https://the.website.mil/inside/
not 
Caution-https://the.website.mil/inside/


Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Musshorn, Kris T CTR USARMY RDECOM ARL (US) [mailto:kris.t.musshorn.ctr@mail.mil] 
Sent: Thursday, July 21, 2016 7:38 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

CLASSIFICATION: UNCLASSIFIED

Working thru the tutorial for v1 of nutch.
urls/seed.txt contains
Caution-https://the.website.mil/inside/

regex-urlfilter.txt contains edits...

# accept anything else
#+.

# limit to the.website.mil
+^Caution-https://([a-z0-9]*\.)the.website.mil/inside

Yet nothing gets populated in the crawl db...

bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-07-21 07:32:02
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 1
Injector: Total number of urls after normalization: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 0
Injector: Total new urls injected: 0

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


CLASSIFICATION: UNCLASSIFIED


CLASSIFICATION: UNCLASSIFIED