You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2012/04/20 21:26:16 UTC

Crawling an SCM to update a Solr index

Hello everyone,

I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I would, however, like to periodically crawl the SCM to ensure the index is up to date. In addition, if updates are made which require a complete reindex (such as schema.xml modifications), I could utilize this crawler to update everything or specific areas.

I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers out there, which crawl an SCM. More specifically, I'm looking for a Perforce solution. I'm guessing that there is nothing specific and I'm prepared to design to our specific requirements, but wanted to check with the Solr community prior to getting too far in.

I'm most likely going to build the solution to interact with the SCM directly (via their API) versus sync'ing the SCM repository to the filesystem and crawl that way, since there could be filesystem problem syncing the data and because there may be relevant metadata information that can be retrieved from the SCM.

Thanks in advance for any information you may have,
Kristian

RE: Crawling an SCM to update a Solr index

Posted by "Van Tassell, Kristian" <kr...@siemens.com>.
Otis,

Thanks for the input! Were it not the metadata I need to extract and the slight possibility a sync error/file system error or inconsistency could occur, I would take that same route. 

-Kristian

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Friday, April 20, 2012 10:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Crawling an SCM to update a Solr index

Kristian,

For what it's worth, for http://search-lucene.com and http://search-hadoop.com we simply check out the source code from the SCM and index from the file system.  It works reasonably well.  The only issues that I can recall us having is with the source code organization under SCM - modules get moved around and sometimes this requires us to update stuff on our end to match those changes.

Otis
----
Performance Monitoring for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: "Van Tassell, Kristian" <kr...@siemens.com>
>To: "solr-user@lucene.apache.org" <so...@lucene.apache.org> 
>Sent: Friday, April 20, 2012 3:26 PM
>Subject: Crawling an SCM to update a Solr index
> 
>Hello everyone,
>
>I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I would, however, like to periodically crawl the SCM to ensure the index is up to date. In addition, if updates are made which require a complete reindex (such as schema.xml modifications), I could utilize this crawler to update everything or specific areas.
>
>I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers out there, which crawl an SCM. More specifically, I'm looking for a Perforce solution. I'm guessing that there is nothing specific and I'm prepared to design to our specific requirements, but wanted to check with the Solr community prior to getting too far in.
>
>I'm most likely going to build the solution to interact with the SCM directly (via their API) versus sync'ing the SCM repository to the filesystem and crawl that way, since there could be filesystem problem syncing the data and because there may be relevant metadata information that can be retrieved from the SCM.
>
>Thanks in advance for any information you may have,
>Kristian
>
>
>

Re: Crawling an SCM to update a Solr index

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Kristian,

For what it's worth, for http://search-lucene.com and http://search-hadoop.com we simply check out the source code from the SCM and index from the file system.  It works reasonably well.  The only issues that I can recall us having is with the source code organization under SCM - modules get moved around and sometimes this requires us to update stuff on our end to match those changes.

Otis
----
Performance Monitoring for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: "Van Tassell, Kristian" <kr...@siemens.com>
>To: "solr-user@lucene.apache.org" <so...@lucene.apache.org> 
>Sent: Friday, April 20, 2012 3:26 PM
>Subject: Crawling an SCM to update a Solr index
> 
>Hello everyone,
>
>I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I would, however, like to periodically crawl the SCM to ensure the index is up to date. In addition, if updates are made which require a complete reindex (such as schema.xml modifications), I could utilize this crawler to update everything or specific areas.
>
>I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers out there, which crawl an SCM. More specifically, I'm looking for a Perforce solution. I'm guessing that there is nothing specific and I'm prepared to design to our specific requirements, but wanted to check with the Solr community prior to getting too far in.
>
>I'm most likely going to build the solution to interact with the SCM directly (via their API) versus sync'ing the SCM repository to the filesystem and crawl that way, since there could be filesystem problem syncing the data and because there may be relevant metadata information that can be retrieved from the SCM.
>
>Thanks in advance for any information you may have,
>Kristian
>
>
>