You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tolga <to...@ozses.net> on 2012/05/16 09:43:49 UTC

curl or nutch

Hi,

I have been trying for a week. I really want to get a start, so what 
should I use? curl or nutch? I want to be able to index pdf, xml etc. 
and search within them as well.

Regards,

Re: curl or nutch

Posted by Otis Gospodnetic <ot...@yahoo.com>.
It can, as can ManifoldCF.  But you should ask on nutch-user list (this may also be documented on the Wiki)

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



>________________________________
> From: Tolga <to...@ozses.net>
>To: solr-user@lucene.apache.org 
>Sent: Wednesday, May 16, 2012 8:11 AM
>Subject: Re: curl or nutch
> 
>Can nutch crawl/index files as well?
>
>On 5/16/12 12:29 PM, findbestopensource wrote:
>> You could very well use Solr. It has support to index the PDF and XML
>> files. If you want to index websites and search using page rank then choose
>> Nutch.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com
>>
>>
>> On Wed, May 16, 2012 at 1:13 PM, Tolga<to...@ozses.net>  wrote:
>>
>>> Hi,
>>>
>>> I have been trying for a week. I really want to get a start, so what
>>> should I use? curl or nutch? I want to be able to index pdf, xml etc. and
>>> search within them as well.
>>>
>>> Regards,
>>>
>
>
>

Re: curl or nutch

Posted by Tolga <to...@ozses.net>.
Can nutch crawl/index files as well?

On 5/16/12 12:29 PM, findbestopensource wrote:
> You could very well use Solr. It has support to index the PDF and XML
> files. If you want to index websites and search using page rank then choose
> Nutch.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
> On Wed, May 16, 2012 at 1:13 PM, Tolga<to...@ozses.net>  wrote:
>
>> Hi,
>>
>> I have been trying for a week. I really want to get a start, so what
>> should I use? curl or nutch? I want to be able to index pdf, xml etc. and
>> search within them as well.
>>
>> Regards,
>>

Re: curl or nutch

Posted by findbestopensource <fi...@gmail.com>.
You could very well use Solr. It has support to index the PDF and XML
files. If you want to index websites and search using page rank then choose
Nutch.

Regards
Aditya
www.findbestopensource.com


On Wed, May 16, 2012 at 1:13 PM, Tolga <to...@ozses.net> wrote:

> Hi,
>
> I have been trying for a week. I really want to get a start, so what
> should I use? curl or nutch? I want to be able to index pdf, xml etc. and
> search within them as well.
>
> Regards,
>

Re: curl or nutch

Posted by Tirthankar Chatterjee <tc...@commvault.com>.
If you use curl you will need to track every document and recurse inside folders,etc. 
If you use nutch it takes care of incremental crawling in the configured locations and submits the docs which changed from its previous run.

The lack of a simple File system crawler around Solr is a big disadvantage. You can look at Aperture, Manifold CF frameworks for comparing with nutch. 
Thanks,
Tirthankar
Sent from handheld

----- Original Message -----
From: Tolga [mailto:tolga@ozses.net]
Sent: Wednesday, May 16, 2012 03:43 AM
To: solr-user@lucene.apache.org <so...@lucene.apache.org>; user@nutch.apache.org <us...@nutch.apache.org>
Subject: curl or nutch

Hi,

I have been trying for a week. I really want to get a start, so what 
should I use? curl or nutch? I want to be able to index pdf, xml etc. 
and search within them as well.

Regards,
******************Legal Disclaimer***************************
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*********************************************************

Re: curl or nutch

Posted by Tirthankar Chatterjee <tc...@commvault.com>.
If you use curl you will need to track every document and recurse inside folders,etc. 
If you use nutch it takes care of incremental crawling in the configured locations and submits the docs which changed from its previous run.

The lack of a simple File system crawler around Solr is a big disadvantage. You can look at Aperture, Manifold CF frameworks for comparing with nutch. 
Thanks,
Tirthankar
Sent from handheld

----- Original Message -----
From: Tolga [mailto:tolga@ozses.net]
Sent: Wednesday, May 16, 2012 03:43 AM
To: solr-user@lucene.apache.org <so...@lucene.apache.org>; user@nutch.apache.org <us...@nutch.apache.org>
Subject: curl or nutch

Hi,

I have been trying for a week. I really want to get a start, so what 
should I use? curl or nutch? I want to be able to index pdf, xml etc. 
and search within them as well.

Regards,
******************Legal Disclaimer***************************
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*********************************************************