You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2015/03/31 14:38:08 UTC
Re: [MASSMAIL]Re: [MASSMAIL]Re: website structure discovery?

Inline response,

----- Original Message -----
From: "Scott Lundgren" <sl...@qsfllc.com>
To: user@nutch.apache.org
Sent: Monday, March 30, 2015 5:14:31 PM
Subject: [MASSMAIL]Re: [MASSMAIL]Re: website structure discovery?

I’m using url-regexfilter.txt to not only keep nutch from leaving a site that’s in seed.txt but also to keep nutch very focussed on the URLs within the seed I want nutch to crawl. For example in seed.txt is http://bizjournals.com/triangle, I want to crawl http://www.bizjournals.com/triangle/news but not http://www.bizjournals.com/triangle/jobs/, http://www.bizjournals.com/triangle/calendar/ or http://www.bizjournals.com/triangle/people/

I understand your use case but the question here is if you have some heuristic or deterministic way of defining what you want or don't want to crawl? Because perhaps you can implement that logic into a custom plugin, that will decide if an URL is valid for crawling or not. For the examples that you've provided I don't see any way other than manual inspection, which is time consuming indeed. 

Figuring out these regex’s involves me mousing over links of a site in Chrome browser and a text-only browser. It’s a little time consuming and I have a 200+ sites to set up. I’ll trying standing up a separate instance of nutch plus the link-extractor and D3.js solution.

This plugin should allow you a more simple view of your sites structure (with the help of d3.js) but then you'll need two passes over your seed URLs, 1 to populate Solr/ES with the link structure of each page and other to just crawl the actual URLs of interest. 

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<ma...@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 3:32 PM, Jorge Luis Betancourt González <jl...@uci.cu>> wrote:

What are you using url-regexfilter.txt for? What is your goal? crawl only the websites of your interest? meaning not "leaving" your seed URLs? If the website design changes as long as the URLs are the same this shouldn't be such a big deal.

By default Nutch doesn't index the link the structure (inlinks & outlinks) of each page, you can use [1] which will allow you to store this information in Solr/ES, although this only works for Nutch 1.x, after this you can write some small application that will generate what you want, for instance I've used [1] and d3.js to create some simple graphs about the link structure of the crawled sites, this is not exactly what you want but can be a starting point. I think that a sitemap generator shouldn't be too hard to create from the indexed inlinks & outlinks, or pulling the data directly out of Nutch stored info.

[1] https://github.com/jorgelbg/links-extractor

----- Original Message -----
From: "Scott Lundgren" <sl...@qsfllc.com>>
To: user@nutch.apache.org<ma...@nutch.apache.org>
Sent: Monday, March 30, 2015 12:48:40 PM
Subject: [MASSMAIL]Re: website structure discovery?

Sorta. I’m using Nutch to crawl and index very specific areas of content on a test website resulting in a highly crafted url-regexfilter.tx file. The downside is a brittle process is a website redesign breaks the setup. It’s also a slow process that I have to do for each site and eventually I want to be crawling & indexing about several hundred specific sites. So I need a way to index and “onboard” a new site in an automated way.

So I’m wondering if Nutch is the best spider/tool to run through an entire site and the resulting output is a visual graph or text representation of the site’s directory/URL structure when a sitemap file is not available.

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<ma...@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov>> wrote:

Hi Scott,

It’s a pretty good tool for that - it is a Web Crawler, which
is used to discover the web graph of a domain or of the entire
internet - from pages, to documents, to images, to other web
resources.

Nutch crawls, identifies URLs, fetches them, parses, them and
indexes them for search. It can do in a scalable fashion to
grow with the size of what you are trying to discover.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Scott Lundgren <sl...@qsfllc.com>>
Reply-To: "user@nutch.apache.org<ma...@nutch.apache.org>" <us...@nutch.apache.org>>
Date: Monday, March 30, 2015 at 5:56 AM
To: "user@nutch.apache.org<ma...@nutch.apache.org>" <us...@nutch.apache.org>>
Subject: website structure discovery?

If I want to crawl & learn the directory & information structure of a
website is nutch a good tool for this problem?
Would you recommend a different tool?

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<ma...@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage
Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™