You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by bruce <be...@earthlink.net> on 2008/08/14 23:51:57 UTC
lucene/nutch question...
Hi.
Got a very basic lucene/nutch question.
Assume I have a page that has a form. Within the form are a number of
select/drop-down boxes/etc... In this case, each object would comprise a
variable which would form part of the query string as defined in the form
action. Is there a way for lucene/nutch to go through the process of
building up the actions based on the querystring vars, so that lucene/nutch
can actually search through each possible combination of urls....
Also, is nutch/lucene the right/correct app to use in this scenario? Is
there a better app to handle this kind of potential application/process.
Thanks
-bruce
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: lucene/nutch question...
Posted by bruce <be...@earthlink.net>.
Hi Roman...
umm no. assume you have a web page, and the page has a form on it. within the form, there might be multiple elements (lists/select statements, etc...). each item would have a varname, which would in turn be used as part of the form action, to create the entire query...
sort of like:
form action=test.php?
option
name=foo
foo=1
foo=2
foo=3
foo=4
/option
option
name=cat
cat=1
cat=2
cat=3
/option
/form
so you'd get the following urls in this psuedo example:
test.php?foo=1&cat=1
test.php?foo=1&cat=2
test.php?foo=1&cat=3
test.php?foo=2&cat=1
test.php?foo=2&cat=2
test.php?foo=2&cat=3
test.php?foo=3&cat=1
test.php?foo=3&cat=2
test.php?foo=3&cat=3
test.php?foo=4&cat=1
test.php?foo=4&cat=2
test.php?foo=4&cat=3
with this, the app can then continue to crawl the pages. so, i'm looking for some sort of crawler that already does this kind of analysis within the page.
i know i can create a python/perl script for a sing site/page.. but since i'm looking at 100s of sites...
this is why i'm asking about nutch/lucene/solr...
thanks
-----Original Message-----
From: brainstorm [mailto:braincode@gmail.com]
Sent: Thursday, August 14, 2008 3:12 PM
To: nutch-user@lucene.apache.org
Subject: Re: lucene/nutch question...
If I understand correctly, you are looking for a way to test/fill
forms... if that's the case, I recommend the following tools:
http://wtr.rubyforge.org/
http://search.cpan.org/~petdance/WWW-Mechanize-1.34/lib/WWW/Mechanize.pm
But I guess that with coding effort, nutch can also archieve what you want.
Regards,
Roman
On Thu, Aug 14, 2008 at 11:51 PM, bruce <be...@earthlink.net> wrote:
> Hi.
>
> Got a very basic lucene/nutch question.
>
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls....
>
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
>
> Thanks
>
> -bruce
>
>
>
>
>
>
>
Re: lucene/nutch question...
Posted by brainstorm <br...@gmail.com>.
If I understand correctly, you are looking for a way to test/fill
forms... if that's the case, I recommend the following tools:
http://wtr.rubyforge.org/
http://search.cpan.org/~petdance/WWW-Mechanize-1.34/lib/WWW/Mechanize.pm
But I guess that with coding effort, nutch can also archieve what you want.
Regards,
Roman
On Thu, Aug 14, 2008 at 11:51 PM, bruce <be...@earthlink.net> wrote:
> Hi.
>
> Got a very basic lucene/nutch question.
>
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls....
>
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
>
> Thanks
>
> -bruce
>
>
>
>
>
>
>