You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by bruce <be...@earthlink.net> on 2009/03/04 22:30:01 UTC

crawler questions..

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information.

My issue is now that I have a "basic" app that works, i need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..

How do I know when I have a complete "tree"!

I'm looking for someone, or some group/prof that I can talk to about these
issues. My goal is to eventually look at using nutch/lucene if at all
applicable.

Any pointers, or people, or papers, etc... would be helpful.

Thanks





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: crawler questions..

Posted by adasal <ad...@gmail.com>.

That's interesting.
I've been working in python recently, not crawling though.
But, as ever, the more you get into it the more curious you get.
Did you come up with a solution to a node error?
Are you really talking about a broken link, or are you just saying the
bottom of the tree has been reached?
Presumably the last one would be when every link on every page has been
followed, which means you have to track what pages have been crawled and
find a way of uniquely and correctly identifying them internally? I think
the problem is that while a URL might be unique, there can be more than one
URL pointing to the same content - for instance in struts where action a and
action b are appended to a URL but produce the same result. I believe I am
right about this.
In the site that I am working on google have told us they are unable to
crawl the whole site because some URLs result in a loop - another problem.
It would be cool if you have solved these sorts of problems, or rather can
identify where they are on a site in a quick and easy way.

Best,
Adam

2009/3/4 bruce <be...@earthlink.net>

> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to extract
> course/class information. I've built a quick test app in python to crawl
> the
> site. I crawl at the top level, and work my way down to getting the
> required
> course/class schedule. The app works. I can consistently run it and extract
> the information.
>
> My issue is now that I have a "basic" app that works, i need to figure out
> how I guarantee that I'm correctly crawling the site. How do I know when
> I've got an error at a given node/branch, so that the app knows that it's
> not going to fetch the underlying branch/nodes of the tree..
>
> How do I know when I have a complete "tree"!
>
> I'm looking for someone, or some group/prof that I can talk to about these
> issues. My goal is to eventually look at using nutch/lucene if at all
> applicable.
>
> Any pointers, or people, or papers, etc... would be helpful.
>
> Thanks
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: crawler questions..

Posted by Tim Williams <wi...@gmail.com>.

On Wed, Mar 4, 2009 at 4:41 PM, Grant Ingersoll <gs...@apache.org> wrote:
> You might have a look at Droids (http://incubator.apache.org/droids/) or
> Nutch (http://lucene.apache.org/nutch) and their communities.  They are much
> more focused on crawling (not to say there aren't people here who crawl,
> just saying those projects are (mostly) about crawling)
>
>
> On Mar 4, 2009, at 4:30 PM, bruce wrote:
>
>> Hi...
>>
>> Sorry that this is a bit off track. Ok, maybe way off track!
>>
>> But I don't have anyone to bounce this off of..
>>
>> I'm working on a crawling project, crawling a college website, to extract
>> course/class information. I've built a quick test app in python to crawl
>> the
>> site. I crawl at the top level, and work my way down to getting the
>> required
>> course/class schedule. The app works. I can consistently run it and
>> extract
>> the information.
>>
>> My issue is now that I have a "basic" app that works, i need to figure out
>> how I guarantee that I'm correctly crawling the site. How do I know when
>> I've got an error at a given node/branch, so that the app knows that it's
>> not going to fetch the underlying branch/nodes of the tree..
>>
>> How do I know when I have a complete "tree"!
>>
>> I'm looking for someone, or some group/prof that I can talk to about these
>> issues. My goal is to eventually look at using nutch/lucene if at all
>> applicable.
>>
>> Any pointers, or people, or papers, etc... would be helpful.

The ir-book[1] also has a section on the subject if you hadn't already
seen it.

--tim

[1] - http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: crawler questions..

Posted by Grant Ingersoll <gs...@apache.org>.

You might have a look at Droids (http://incubator.apache.org/droids/)  
or Nutch (http://lucene.apache.org/nutch) and their communities.  They  
are much more focused on crawling (not to say there aren't people here  
who crawl, just saying those projects are (mostly) about crawling)


On Mar 4, 2009, at 4:30 PM, bruce wrote:

> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to  
> extract
> course/class information. I've built a quick test app in python to  
> crawl the
> site. I crawl at the top level, and work my way down to getting the  
> required
> course/class schedule. The app works. I can consistently run it and  
> extract
> the information.
>
> My issue is now that I have a "basic" app that works, i need to  
> figure out
> how I guarantee that I'm correctly crawling the site. How do I know  
> when
> I've got an error at a given node/branch, so that the app knows that  
> it's
> not going to fetch the underlying branch/nodes of the tree..
>
> How do I know when I have a complete "tree"!
>
> I'm looking for someone, or some group/prof that I can talk to about  
> these
> issues. My goal is to eventually look at using nutch/lucene if at all
> applicable.
>
> Any pointers, or people, or papers, etc... would be helpful.
>
> Thanks
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org