You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in> on 2011/07/15 14:04:57 UTC

Re: Is it possible to crawl yahoo answer?

Don't think that should be a problem. Though I still feel you would have to
try to actually know, because am not sure if it is going to crawl to an
encrypted url (Experts please help here)

Just make sure the following line is coomented out in crawl-urlfilter.txt:

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

And add the following line:

+^http://answers.yahoo.com/([a-zA-Z0-9-_/]*)

Hopefully it should work. Good luck.



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-crawl-yahoo-answer-tp3171559p3171764.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Is it possible to crawl yahoo answer?

Posted by bbiglari <de...@gmail.com>.
Hi, 
I wonder if you would able to crawl the Yahoo Answer. I am trying to do it
now, I would be appreciate it if you  could give me some advise before hand.

Thanks for your help, 

Best,

--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-crawl-yahoo-answer-tp3171559p3411176.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Is it possible to crawl yahoo answer?

Posted by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in>.
I am not sure if i was able to convey what i meant. But I guess it was a bit
confusing now that I re-read my previous comment.

You are supposed to un-comment the line 

-[?*!@=] 

This will help nutch crawl through urls with special characters.

Please revert back if you have done this.

--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-crawl-yahoo-answer-tp3171559p3178185.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Is it possible to crawl yahoo answer?

Posted by Kelvin <ks...@yahoo.com.sg>.
Hi Tamanjit,

Thank you for your help. I tried your suggestion, but it crawl every normal url except url of this type

answers.yahoo.com/question/index;_ylt=AtKz1xss1AS6RGeAQTFz1kyf5HNG;_ylv=3?qid=20110715030336AAzXnNs

I also try this suggestion by

lucene.472066.n3.nabble.com/How-to-make-nutch-crawl-within-a-sub-category-of-an-URL-td619381.html

Use http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&sid=396545660  as the url to crawl.

Add this in the  crawl-urlfilter.txt

  +^http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&sid=396545660
   +^http://answers.yahoo.com/question
   -. 

but it couldn't crawl anything

Does this mean that nutch can only crawl normal hyperlink?




________________________________
From: "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in>
To: nutch-user@lucene.apache.org
Sent: Friday, 15 July 2011 8:04 PM
Subject: Re: Is it possible to crawl yahoo answer?

Don't think that should be a problem. Though I still feel you would have to
try to actually know, because am not sure if it is going to crawl to an
encrypted url (Experts please help here)

Just make sure the following line is coomented out in crawl-urlfilter.txt:

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

And add the following line:

+^http://answers.yahoo.com/([a-zA-Z0-9-_/]*)

Hopefully it should work. Good luck.



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-crawl-yahoo-answer-tp3171559p3171764.html
Sent from the Nutch - User mailing list archive at Nabble.com.