You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Semyon Semyonov <se...@mail.com> on 2017/11/03 15:59:17 UTC

Nutch(plugins) and R

Hello,

I'm looking for a way to use R in Nutch, particularly HTML parser, but usage in the other parts can be intresting as well. For each parsed document I would like to run a script and provide the results back to the system e.g. topic detection of the document.
 
NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R server. This way uses Hadoop as an execution engine after the crawling process. In other words, first the computationally intensive full crawling after that another computationally intensive R/Hadoop process.
 
Instead I'm looking for a way of calling R scripts directly from java code of map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - Binary R server", but I'm looking for alternatives, to compare efficiency.

Semyon.