You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by Pablo Ovelleiro <bi...@gmail.com> on 2014/10/20 13:34:46 UTC

SOLR + Nutch save the seeds in Solr

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I'm trying to set up a Nutch+Solr to crawl a list of domains.
I want to get 50 pages per seed in the list (no external links) and
save the seed each page came from in the result.

The goal is to be able to query for a word and get all the seeds from
my list that lead to a page containing it.

Example:
I have a seed list with:

http://domainone.com
http://domaintwo.com
http://domainthree.com
http://domainfour.com
http://domainfive.com

I save 50 subpages from each of them to solr. (Total of 5*50=250 pages
indexed in solr)

Now I query for "foobar" and want to get the items back from the
seedlist which contained the word "foobar" or have subpages
(http://domainthree.com/somepage.html) that contained that word.



How would I save the seed a page came originally from in solr?

Thanks,

binaryplease
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJURPNWAAoJEB9MwmVSfxihkwoH/RbAT5Qlhy2ZqAF5IlbesXR8
seDIKUsk019iWc8L2s7Pe2NcMaMc7tGwXR2ukbLLIO6Ltuygt0W3Odx9O+2YRtlh
XOG45Z3jvODZbYWRdQQ5uX6FdMkGMCz8xBxKKKfO35fsSVSiXSb2P4+taqvlFjSh
7ubpZCONCu124D5r5VhgtIlpWvolTWQLOXG2YDUqQrreFw2aSA7huUP8iyds5so2
Wp8IrBRWpFIHXA1AMeB5imgH+fmRxopg/lenUiyqHLTrxqIt3dIdipKk55+8qeks
yBtFkwZhktNfi2tmmZpVxckmyMO/Ru+S3Dhwcrvg2Yt5NR0znPBnAE62bAH9z9I=
=+H4B
-----END PGP SIGNATURE-----