You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Dishanker Raj <di...@adm.uib.no> on 2013/07/04 12:46:17 UTC

How to crawl a Windows network share?

Hi all!

I have setup ManifoldCF v1.2 on a RHEL6 box running Tomcat v6.0.24 for crawling just one Windows network share path. The fetched documents are to be fed into a Solr core.

The binary release of ManifoldCF is setup as a "simplied multi-process" instance similar to what is described here in the documentation: http://goo.gl/qUkxf .

Both connections work reported by the http://<reverse-proxy-to-tomcat-server>/mcf-crawler-ui/ .

What I am uncertain of is whether to only click "Start" on the crawler UI (list jobs), only run the './start-agents.sh' shell script, or do both?

Anyway the crawling of the remote Windows share and indexing of Solr never completes. The tomcat/manifoldcf log files do not show obvious errors. 

Previously I had "permission denied" type error messages in the log file because user 'tomcat' could not write to files in folder 'syncharea' after my user ran './initialize.sh' and './start-agents.sh'. This has been avoided by setting 'chmod -R 777' on the entire 'syncharea' directory for testing purposes.

I would greatly appreciate if someone has pointers to what I could do next to troubleshoot all of the above.

Thanks!

Sincerely,
Dishanker Raj

Re: How to crawl a Windows network share?

Posted by Karl Wright <da...@gmail.com>.
Simplified multiprocess means that the agents process is not the same as
the web applications.  So yes, you have to run the agents process before
any crawling will happen.

The documentation you refer to seems quite clear on this matter:

"
Initializing the database and running

If you run the multiprocess model, after you first start the database
(using *start-database[.sh|.bat]*), you will need to initialize the
database before you start the agents process or use the crawler UI. To do
this, all you need to do is run the *initialize[.sh|.bat]* script. Then,
you will need to start the web applications (using *start-webapps[.sh|.bat]*)
and the agents process (using *start-agents[.sh|.bat]*)."


I strongly suggest that you try to run the example out-of-the-box, without
going to tomcat or any other complication, just to see how everything works
up front.



Karl



On Thu, Jul 4, 2013 at 6:46 AM, Dishanker Raj <di...@adm.uib.no>wrote:

> Hi all!
>
> I have setup ManifoldCF v1.2 on a RHEL6 box running Tomcat v6.0.24 for
> crawling just one Windows network share path. The fetched documents are to
> be fed into a Solr core.
>
> The binary release of ManifoldCF is setup as a "simplied multi-process"
> instance similar to what is described here in the documentation:
> http://goo.gl/qUkxf .
>
> Both connections work reported by the http://<reverse-proxy-to-tomcat-server>/mcf-crawler-ui/
> .
>
> What I am uncertain of is whether to only click "Start" on the crawler UI
> (list jobs), only run the './start-agents.sh' shell script, or do both?
>
> Anyway the crawling of the remote Windows share and indexing of Solr never
> completes. The tomcat/manifoldcf log files do not show obvious errors.
>
> Previously I had "permission denied" type error messages in the log file
> because user 'tomcat' could not write to files in folder 'syncharea' after
> my user ran './initialize.sh' and './start-agents.sh'. This has been
> avoided by setting 'chmod -R 777' on the entire 'syncharea' directory for
> testing purposes.
>
> I would greatly appreciate if someone has pointers to what I could do next
> to troubleshoot all of the above.
>
> Thanks!
>
> Sincerely,
> Dishanker Raj