You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Muhammad UMER <mu...@hotmail.com> on 2017/09/05 10:01:01 UTC

Nutch crawl for specifice word not for specific url Then get the structure data and store in hbase.

Hi All,

             I am new and working on Apache Nutch to crawl some sites , filter and get content on the base of word not on the base of url. e.g.


  1.  I have to crawl those sites  that contain word like 'shop'  or 'product' in contents(text). if this word not exists then not crawl further links.
  2.  I want to get structured (json fields e.g text , url , metadata etc.) data instead of unstructured(whole page source) data.

any little help be appreciable.

Regards
Muhammad umer

How Nutch crawl for specifice word not for specific url Then get the structure data and store in hbase.

Posted by Muhammad UMER <mu...@hotmail.com>.

Hi All,

             I am new Using Apache Nutch to crawl some sites , filter and get content on the base of word not on the base of url. e.g.


  1.  I have to crawl those sites  that contain words like 'shop'  or 'product' in contents(text). if these word not exists then not crawl further links on that page and leave the page to further parse.
  2.  Apache Nutch is directly interact with the HBASE to dump whole webpage source html but I want to get structured (json formate e.g text , url , metadata etc.) data instead of unstructured(whole page source) data.
  3.  Then Apache Nutch send this data to solr where data is index and structured. but I want to show this data on my on web page instead of solr web page. how can I get this data in structured format and categorized. it with words i provide it to Nutch.

that's what I want to achieve, any little help would be appreciable.

Regards
Muhammad umer

How Nutch crawl for specifice word not for specific url Then get the structure data and store in hbase.

Posted by Muhammad UMER <mu...@hotmail.com>.

Hi All,

             I am new Using Apache Nutch to crawl some sites , filter and get content on the base of word not on the base of url. e.g.


  1.  I have to crawl those sites  that contain words like 'shop'  or 'product' in contents(text). if these word not exists then not crawl further links on that page and leave the page to further parse.
  2.  Apache Nutch is directly interact with the HBASE to dump whole webpage source html but I want to get structured (json formate e.g text , url , metadata etc.) data instead of unstructured(whole page source) data.
  3.  Then Apache Nutch send this data to solr where data is index and structured. but I want to show this data on my on web page instead of solr web page. how can I get this data in structured format and categorized. it with words i provide it to Nutch.

that's what I want to achieve, any little help would be appreciable.

Regards
Muhammad umer