You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/07/21 11:38:25 UTC
tutorial work thru (UNCLASSIFIED)
CLASSIFICATION: UNCLASSIFIED
Working thru the tutorial for v1 of nutch.
urls/seed.txt contains
https://the.website.mil/inside/
regex-urlfilter.txt contains edits...
# accept anything else
#+.
# limit to the.website.mil
+^https://([a-z0-9]*\.)the.website.mil/inside
Yet nothing gets populated in the crawl db...
bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-07-21 07:32:02
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 1
Injector: Total number of urls after normalization: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 0
Injector: Total new urls injected: 0
Thanks,
Kris
~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.
US Army Research Lab
Aberdeen Proving Ground
Application Management & Development Branch
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~
CLASSIFICATION: UNCLASSIFIED
RE: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)
Posted by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil>.
CLASSIFICATION: UNCLASSIFIED
Damn email policy.
You get it.
Thanks,
Kris
~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.
US Army Research Lab
Aberdeen Proving Ground
Application Management & Development Branch
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~
-----Original Message-----
From: Musshorn, Kris T CTR USARMY RDECOM ARL (US) [mailto:kris.t.musshorn.ctr@mail.mil]
Sent: Thursday, July 21, 2016 8:02 AM
To: user@nutch.apache.org
Subject: RE: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)
All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.
----
CLASSIFICATION: UNCLASSIFIED
Clarification...
Seed.txt contains
Caution-https://the.website.mil/inside/
not
Caution-Caution-https://the.website.mil/inside/
Thanks,
Kris
~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.
US Army Research Lab
Aberdeen Proving Ground
Application Management & Development Branch
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~
-----Original Message-----
From: Musshorn, Kris T CTR USARMY RDECOM ARL (US) [Caution-mailto:kris.t.musshorn.ctr@mail.mil]
Sent: Thursday, July 21, 2016 7:38 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)
All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.
----
CLASSIFICATION: UNCLASSIFIED
Working thru the tutorial for v1 of nutch.
urls/seed.txt contains
Caution-Caution-https://the.website.mil/inside/
regex-urlfilter.txt contains edits...
# accept anything else
#+.
# limit to the.website.mil
+^Caution-Caution-https://([a-z0-9]*\.)the.website.mil/inside
Yet nothing gets populated in the crawl db...
bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-07-21 07:32:02
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 1
Injector: Total number of urls after normalization: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 0
Injector: Total new urls injected: 0
Thanks,
Kris
~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.
US Army Research Lab
Aberdeen Proving Ground
Application Management & Development Branch
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~
CLASSIFICATION: UNCLASSIFIED
CLASSIFICATION: UNCLASSIFIED
CLASSIFICATION: UNCLASSIFIED
RE: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)
Posted by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil>.
CLASSIFICATION: UNCLASSIFIED
Clarification...
Seed.txt contains
https://the.website.mil/inside/
not
Caution-https://the.website.mil/inside/
Thanks,
Kris
~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.
US Army Research Lab
Aberdeen Proving Ground
Application Management & Development Branch
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~
-----Original Message-----
From: Musshorn, Kris T CTR USARMY RDECOM ARL (US) [mailto:kris.t.musshorn.ctr@mail.mil]
Sent: Thursday, July 21, 2016 7:38 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] tutorial work thru (UNCLASSIFIED)
All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.
----
CLASSIFICATION: UNCLASSIFIED
Working thru the tutorial for v1 of nutch.
urls/seed.txt contains
Caution-https://the.website.mil/inside/
regex-urlfilter.txt contains edits...
# accept anything else
#+.
# limit to the.website.mil
+^Caution-https://([a-z0-9]*\.)the.website.mil/inside
Yet nothing gets populated in the crawl db...
bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-07-21 07:32:02
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 1
Injector: Total number of urls after normalization: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 0
Injector: Total new urls injected: 0
Thanks,
Kris
~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.
US Army Research Lab
Aberdeen Proving Ground
Application Management & Development Branch
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~
CLASSIFICATION: UNCLASSIFIED
CLASSIFICATION: UNCLASSIFIED