You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by bruce <be...@earthlink.net> on 2008/08/14 23:51:57 UTC

lucene/nutch question...

Hi.

Got a very basic lucene/nutch question.

Assume I have a page that has a form. Within the form are a number of
select/drop-down boxes/etc... In this case, each object would comprise a
variable which would form part of the query string as defined in the form
action. Is there a way for lucene/nutch to go through the process of
building up the actions based on the querystring vars, so that lucene/nutch
can actually search through each possible combination of urls....

Also, is nutch/lucene the right/correct app to use in this scenario? Is
there a better app to handle this kind of potential application/process.

Thanks

-bruce







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: lucene/nutch question...

Posted by bruce <be...@earthlink.net>.
Hi Roman...

umm no. assume you have a web page, and the page has a form on it. within the form, there might be multiple elements (lists/select statements, etc...). each item would have a varname, which would in turn be used as part of the form action, to create the entire query...

sort of like:
form action=test.php?
 option
  name=foo
  foo=1
  foo=2
  foo=3
  foo=4
 /option

 option
  name=cat
  cat=1
  cat=2
  cat=3
 /option
/form

so you'd get the following urls in this psuedo example:
 test.php?foo=1&cat=1
 test.php?foo=1&cat=2
 test.php?foo=1&cat=3
 test.php?foo=2&cat=1
 test.php?foo=2&cat=2
 test.php?foo=2&cat=3
 test.php?foo=3&cat=1
 test.php?foo=3&cat=2
 test.php?foo=3&cat=3
 test.php?foo=4&cat=1
 test.php?foo=4&cat=2
 test.php?foo=4&cat=3

with this, the app can then continue to crawl the pages. so, i'm looking for some sort of crawler that already does this kind of analysis within the page. 

i know i can create a python/perl script for a sing site/page.. but since i'm looking at 100s of sites... 

this is why i'm asking about nutch/lucene/solr...

thanks


-----Original Message-----
From: brainstorm [mailto:braincode@gmail.com]
Sent: Thursday, August 14, 2008 3:12 PM
To: nutch-user@lucene.apache.org
Subject: Re: lucene/nutch question...


If I understand correctly, you are looking for a way to test/fill
forms... if that's the case, I recommend the following tools:

http://wtr.rubyforge.org/
http://search.cpan.org/~petdance/WWW-Mechanize-1.34/lib/WWW/Mechanize.pm

But I guess that with coding effort, nutch can also archieve what you want.

Regards,
Roman

On Thu, Aug 14, 2008 at 11:51 PM, bruce <be...@earthlink.net> wrote:
> Hi.
>
> Got a very basic lucene/nutch question.
>
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls....
>
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
>
> Thanks
>
> -bruce
>
>
>
>
>
>
>


Re: lucene/nutch question...

Posted by brainstorm <br...@gmail.com>.
If I understand correctly, you are looking for a way to test/fill
forms... if that's the case, I recommend the following tools:

http://wtr.rubyforge.org/
http://search.cpan.org/~petdance/WWW-Mechanize-1.34/lib/WWW/Mechanize.pm

But I guess that with coding effort, nutch can also archieve what you want.

Regards,
Roman

On Thu, Aug 14, 2008 at 11:51 PM, bruce <be...@earthlink.net> wrote:
> Hi.
>
> Got a very basic lucene/nutch question.
>
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls....
>
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
>
> Thanks
>
> -bruce
>
>
>
>
>
>
>