You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2012/03/01 10:51:59 UTC

Re: nutch crawling

Hi,
A similar question has been posted yesterday ("Query in nutch") - as 
Lewis suggested, NUTCH-585 [1] might be what you need.

Best,
Elisabeth

  [1] https://issues.apache.org/jira/browse/NUTCH-585

On 29.02.2012 12:15, sanjay87 wrote:
> Hi Techies,
>
> I am having some queries related to Nutch- the web crawler. I am actually
> done with Crawling the website and indexing the same in SOLR, but the
> problem here is – the Nutch crawler crawls at a domain level i.e. the menu
> items , anchor text and everything which is actually not needed.
>
> I only need to crawl the legitimate content present in the site.
>
> I tried to crawl the  localhost:8080/solr/admin page and the response is not
> legitimate.
>
> The content field is having all the data which is actually not needed.
>
> We have tried a lot of options and still we are unable to find a solution,
> please provide your valuable inputs.
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-tp3786913p3786913.html
> Sent from the Nutch - User mailing list archive at Nabble.com.