You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Teague James <te...@insystechinc.com> on 2014/01/07 20:27:21 UTC

Indexing URLs from websites

I am trying to index a website that contains links to documents such as PDF,
Word, etc. The intent is to be able to store the URLs for the links to the
documents. 

For example, when indexing www.example.com which has links on the page like
"Example Document" which points to www.example.com/docs/example.pdf, I want
Solr to store the text of the link, "Example Document", and the URL for the
link, "www.example.com/docs/example.pdf" in separate fields. I've tried
using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
content, but I am not getting the URLs from the links. There are no document
type restrictions in Nutch for PDF or Word. Any suggestions on how I can
accomplish this? Should I use a different method than Nutch for crawling the
site?

I appreciate any help on this!

Re: Indexing URLs from websites

Posted by Otis Gospodnetic <ot...@gmail.com>.

You could use something like Apache Droids -
http://incubator.apache.org/droids/

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 7, 2014 at 2:27 PM, Teague James <te...@insystechinc.com>wrote:

> I am trying to index a website that contains links to documents such as
> PDF,
> Word, etc. The intent is to be able to store the URLs for the links to the
> documents.
>
> For example, when indexing www.example.com which has links on the page
> like
> "Example Document" which points to www.example.com/docs/example.pdf, I
> want
> Solr to store the text of the link, "Example Document", and the URL for the
> link, "www.example.com/docs/example.pdf" in separate fields. I've tried
> using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
> content, but I am not getting the URLs from the links. There are no
> document
> type restrictions in Nutch for PDF or Word. Any suggestions on how I can
> accomplish this? Should I use a different method than Nutch for crawling
> the
> site?
>
> I appreciate any help on this!
>
>