You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/07/30 14:22:19 UTC

Nutch and Solr

I'm trying to follow the example in the Wiki, but it's corrupt.  It
has a bunch of garbage in the part you're supposed to past into
solrconfig.xml - I don't know if something got interpreted as wiki
markup when it shouldn't, or what, but I doubt superscripts are a
normal part of the configuration.

Can somebody please tell me what I'm supposed to do there?

-- 
http://www.linkedin.com/in/paultomblin

Re: how to exclude some external links

Posted by al...@aim.com.
 


 Hi,

The plugin is enabled in nutch-default.xml file, but changes in it did not affect search. Instead changes in crawl-urlfilter.txt takes changes fetched links.

Thanks.
Alex.


 

-----Original Message-----
From: Paul Tomblin <pt...@xcski.com>
To: nutch-user@lucene.apache.org
Sent: Thu, Jul 30, 2009 6:26 pm
Subject: Re: how to exclude some external links










On Thu, Jul 30, 2009 at 9:15 PM, <al...@aim.com> wrote:

> I would like to know how can I modify nutch code to exclude external links 
with certain extensions. For example, if have in urls mydomain.com and my 
domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch 
not to fetch(crawl) these kind of urls at all.

Can't you do this with the existing RegexURLFilter plugin?  Make sure
urlfilter-regex is listed in plugin.includes, and that you've got the
property urlfilter.regex.file is set to a file (probably
regex-urlfilter.txt).  Then you can list the extensions you want to
skip in that file.

-- 
http://www.linkedin.com/in/paultomblin



 


Re: how to exclude some external links

Posted by Paul Tomblin <pt...@xcski.com>.
On Thu, Jul 30, 2009 at 9:15 PM, <al...@aim.com> wrote:

> I would like to know how can I modify nutch code to exclude external links with certain extensions. For example, if have in urls mydomain.com and my domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch not to fetch(crawl) these kind of urls at all.

Can't you do this with the existing RegexURLFilter plugin?  Make sure
urlfilter-regex is listed in plugin.includes, and that you've got the
property urlfilter.regex.file is set to a file (probably
regex-urlfilter.txt).  Then you can list the extensions you want to
skip in that file.

-- 
http://www.linkedin.com/in/paultomblin

how to exclude some external links

Posted by al...@aim.com.
 

Hi,

I would like to know how can I modify nutch code to exclude external links with certain extensions. For example, if have in urls mydomain.com and my domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch not to fetch(crawl) these kind of urls at all.




Thanks
Alex.





 


Re: Nutch in C++

Posted by "pepone.onrez" <pe...@gmail.com>.
Hi

I think Kde would be a good choice for c++ developers

http://kde.org/getinvolved/development/

On Tue, Aug 4, 2009 at 6:43 PM, Otis Gospodnetic<og...@yahoo.com> wrote:
> Possibly, yes.
> See http://code.google.com/
> See http://www.sf.net/
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: "alxsss@aim.com" <al...@aim.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Tuesday, August 4, 2009 12:36:19 PM
>> Subject: Re: Nutch in C++
>>
>>
>> Thanks for your comments. Is there anything that I code in C++ that open source
>> community could benefit?
>>
>> Alex.
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Otis Gospodnetic
>> To: nutch-user@lucene.apache.org
>> Sent: Tue, Aug 4, 2009 6:54 am
>> Subject: Re: Nutch in C++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> That's exactly right. :)
>>
>> Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>> > From: Iain Downs
>> > To: nutch-user@lucene.apache.org
>> > Sent: Tuesday, August 4, 2009 4:08:18 AM
>> > Subject: RE: Nutch in C++
>> >
>> > I think there is probably a sub text here (I'm putting words in Otis' mouth,
>> > for which my apologies).
>> >
>> > ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But you'd
>> > be mad to do so!
>> >
>> > I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not an
>> > easy conversion because of the different memory management systems.
>> >
>> > And why?  I guess you may see some performance improvement, but it would be
>> > a LOT cheaper to throw hardware at the problem (and you may not see much if
>> > any).
>> >
>> > So if you have a few months to spare ....
>> >
>> >
>> > Iain
>> >
>> > -----Original Message-----
>> > From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com]
>> > Sent: 04 August 2009 04:49
>> > To: nutch-user@lucene.apache.org
>> > Subject: Re: Nutch in C++
>> >
>> > CLucene is just like Lucene (except a few versions behind), but written in
>> > C++.
>> >
>> > Yes, you could rewrite Nutch in C++ and have that use CLucene.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > ----- Original Message ----
>> > > From: "alxsss@aim.com"
>> > > To: nutch-user@lucene.apache.org
>> > > Sent: Monday, August 3, 2009 2:29:40 PM
>> > > Subject: Re: Nutch in C++
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Hi,
>> > >
>> > > I know nutch uses Lucene. But for what is Clucene then? Only for indexing
>> > files
>> > > in a hard drive?
>> > >
>> > >
>> > > I have knowledge of C++ and some experience. I wanted to code crawler of
>> > Nutch
>> > > in C++ to get more experience and make it open source, only if it l be
>> > useful
>> > > for the open source community.
>> > > My goal is to get more experience in C++ and make? contribution to open
>> > source.
>> > > If you know other projects that may be more useful, please let me know.
>> > >
>> > > thanks.
>> > > Alex.
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Otis Gospodnetic
>> > > To: nutch-user@lucene.apache.org
>> > > Sent: Sun, Aug 2, 2009 8:15 pm
>> > > Subject: Re: Nutch in C++
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Nutch uses Lucene (Java), not CLucene (C++).
>> > >
>> > > Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
>> > >
>> > > Otis
>> > > --
>> > > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> > >
>> > >
>> > >
>> > > ----- Original Message ----
>> > > > From: "alxsss@aim.com"
>> > > > To: nutch-user@lucene.apache.org
>> > > > Sent: Thursday, July 30, 2009 3:13:16 PM
>> > > > Subject: Nutch in C++
>> > > >
>> > > > Hi,
>> > > >
>> > > > As I understood only indexing part of nutch is in C++ as clucene.? I
>> > want to
>> > > > code? nutch in C++, only in case if it is worth doing that.? I wondered
>> > if is
>> > > > worth coding the remaining parts of nutch in C++, let say the crawler.
>> > Can
>> > > > someone give me directions on what to start.
>> > > >
>> > > > Thanks
>> > > > Alex.
>
>

Re: Nutch in C++

Posted by Otis Gospodnetic <og...@yahoo.com>.
Possibly, yes.
See http://code.google.com/
See http://www.sf.net/

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: "alxsss@aim.com" <al...@aim.com>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, August 4, 2009 12:36:19 PM
> Subject: Re: Nutch in C++
> 
> 
> Thanks for your comments. Is there anything that I code in C++ that open source 
> community could benefit?
> 
> Alex.
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Otis Gospodnetic 
> To: nutch-user@lucene.apache.org
> Sent: Tue, Aug 4, 2009 6:54 am
> Subject: Re: Nutch in C++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> That's exactly right. :)
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
> > From: Iain Downs 
> > To: nutch-user@lucene.apache.org
> > Sent: Tuesday, August 4, 2009 4:08:18 AM
> > Subject: RE: Nutch in C++
> > 
> > I think there is probably a sub text here (I'm putting words in Otis' mouth,
> > for which my apologies).
> > 
> > ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But you'd
> > be mad to do so!
> > 
> > I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not an
> > easy conversion because of the different memory management systems.
> > 
> > And why?  I guess you may see some performance improvement, but it would be
> > a LOT cheaper to throw hardware at the problem (and you may not see much if
> > any).
> > 
> > So if you have a few months to spare ....
> > 
> > 
> > Iain
> > 
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com] 
> > Sent: 04 August 2009 04:49
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Nutch in C++
> > 
> > CLucene is just like Lucene (except a few versions behind), but written in
> > C++.
> > 
> > Yes, you could rewrite Nutch in C++ and have that use CLucene.
> > 
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> > 
> > 
> > 
> > ----- Original Message ----
> > > From: "alxsss@aim.com" 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Monday, August 3, 2009 2:29:40 PM
> > > Subject: Re: Nutch in C++
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Hi,
> > > 
> > > I know nutch uses Lucene. But for what is Clucene then? Only for indexing
> > files 
> > > in a hard drive?
> > > 
> > > 
> > > I have knowledge of C++ and some experience. I wanted to code crawler of
> > Nutch 
> > > in C++ to get more experience and make it open source, only if it l be
> > useful 
> > > for the open source community.
> > > My goal is to get more experience in C++ and make? contribution to open
> > source. 
> > > If you know other projects that may be more useful, please let me know.
> > > 
> > > thanks.
> > > Alex.
> > > 
> > > 
> > > -----Original Message-----
> > > From: Otis Gospodnetic 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Sun, Aug 2, 2009 8:15 pm
> > > Subject: Re: Nutch in C++
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Nutch uses Lucene (Java), not CLucene (C++).
> > > 
> > > Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
> > > 
> > > Otis
> > > --
> > > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> > > 
> > > 
> > > 
> > > ----- Original Message ----
> > > > From: "alxsss@aim.com" 
> > > > To: nutch-user@lucene.apache.org
> > > > Sent: Thursday, July 30, 2009 3:13:16 PM
> > > > Subject: Nutch in C++
> > > > 
> > > > Hi,
> > > > 
> > > > As I understood only indexing part of nutch is in C++ as clucene.? I
> > want to 
> > > > code? nutch in C++, only in case if it is worth doing that.? I wondered
> > if is 
> > > > worth coding the remaining parts of nutch in C++, let say the crawler.
> > Can 
> > > > someone give me directions on what to start.
> > > > 
> > > > Thanks
> > > > Alex.


Re: Nutch in C++

Posted by al...@aim.com.
 Thanks for your comments. Is there anything that I code in C++ that open source community could benefit?

Alex.


 


 

-----Original Message-----
From: Otis Gospodnetic <og...@yahoo.com>
To: nutch-user@lucene.apache.org
Sent: Tue, Aug 4, 2009 6:54 am
Subject: Re: Nutch in C++










That's exactly right. :)

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Iain Downs <ia...@idcl.co.uk>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, August 4, 2009 4:08:18 AM
> Subject: RE: Nutch in C++
> 
> I think there is probably a sub text here (I'm putting words in Otis' mouth,
> for which my apologies).
> 
> ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But you'd
> be mad to do so!
> 
> I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not an
> easy conversion because of the different memory management systems.
> 
> And why?  I guess you may see some performance improvement, but it would be
> a LOT cheaper to throw hardware at the problem (and you may not see much if
> any).
> 
> So if you have a few months to spare ....
> 
> 
> Iain
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com] 
> Sent: 04 August 2009 04:49
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch in C++
> 
> CLucene is just like Lucene (except a few versions behind), but written in
> C++.
> 
> Yes, you could rewrite Nutch in C++ and have that use CLucene.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
> > From: "alxsss@aim.com" 
> > To: nutch-user@lucene.apache.org
> > Sent: Monday, August 3, 2009 2:29:40 PM
> > Subject: Re: Nutch in C++
> > 
> > 
> > 
> > 
> > 
> > Hi,
> > 
> > I know nutch uses Lucene. But for what is Clucene then? Only for indexing
> files 
> > in a hard drive?
> > 
> > 
> > I have knowledge of C++ and some experience. I wanted to code crawler of
> Nutch 
> > in C++ to get more experience and make it open source, only if it l be
> useful 
> > for the open source community.
> > My goal is to get more experience in C++ and make? contribution to open
> source. 
> > If you know other projects that may be more useful, please let me know.
> > 
> > thanks.
> > Alex.
> > 
> > 
> > -----Original Message-----
> > From: Otis Gospodnetic 
> > To: nutch-user@lucene.apache.org
> > Sent: Sun, Aug 2, 2009 8:15 pm
> > Subject: Re: Nutch in C++
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Nutch uses Lucene (Java), not CLucene (C++).
> > 
> > Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
> > 
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> > 
> > 
> > 
> > ----- Original Message ----
> > > From: "alxsss@aim.com" 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Thursday, July 30, 2009 3:13:16 PM
> > > Subject: Nutch in C++
> > > 
> > > Hi,
> > > 
> > > As I understood only indexing part of nutch is in C++ as clucene.? I
> want to 
> > > code? nutch in C++, only in case if it is worth doing that.? I wondered
> if is 
> > > worth coding the remaining parts of nutch in C++, let say the crawler.
> Can 
> > > someone give me directions on what to start.
> > > 
> > > Thanks
> > > Alex.




 


Re: Nutch in C++

Posted by Otis Gospodnetic <og...@yahoo.com>.
That's exactly right. :)

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Iain Downs <ia...@idcl.co.uk>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, August 4, 2009 4:08:18 AM
> Subject: RE: Nutch in C++
> 
> I think there is probably a sub text here (I'm putting words in Otis' mouth,
> for which my apologies).
> 
> ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But you'd
> be mad to do so!
> 
> I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not an
> easy conversion because of the different memory management systems.
> 
> And why?  I guess you may see some performance improvement, but it would be
> a LOT cheaper to throw hardware at the problem (and you may not see much if
> any).
> 
> So if you have a few months to spare ....
> 
> 
> Iain
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com] 
> Sent: 04 August 2009 04:49
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch in C++
> 
> CLucene is just like Lucene (except a few versions behind), but written in
> C++.
> 
> Yes, you could rewrite Nutch in C++ and have that use CLucene.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
> > From: "alxsss@aim.com" 
> > To: nutch-user@lucene.apache.org
> > Sent: Monday, August 3, 2009 2:29:40 PM
> > Subject: Re: Nutch in C++
> > 
> > 
> > 
> > 
> > 
> > Hi,
> > 
> > I know nutch uses Lucene. But for what is Clucene then? Only for indexing
> files 
> > in a hard drive?
> > 
> > 
> > I have knowledge of C++ and some experience. I wanted to code crawler of
> Nutch 
> > in C++ to get more experience and make it open source, only if it l be
> useful 
> > for the open source community.
> > My goal is to get more experience in C++ and make? contribution to open
> source. 
> > If you know other projects that may be more useful, please let me know.
> > 
> > thanks.
> > Alex.
> > 
> > 
> > -----Original Message-----
> > From: Otis Gospodnetic 
> > To: nutch-user@lucene.apache.org
> > Sent: Sun, Aug 2, 2009 8:15 pm
> > Subject: Re: Nutch in C++
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Nutch uses Lucene (Java), not CLucene (C++).
> > 
> > Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
> > 
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> > 
> > 
> > 
> > ----- Original Message ----
> > > From: "alxsss@aim.com" 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Thursday, July 30, 2009 3:13:16 PM
> > > Subject: Nutch in C++
> > > 
> > > Hi,
> > > 
> > > As I understood only indexing part of nutch is in C++ as clucene.? I
> want to 
> > > code? nutch in C++, only in case if it is worth doing that.? I wondered
> if is 
> > > worth coding the remaining parts of nutch in C++, let say the crawler.
> Can 
> > > someone give me directions on what to start.
> > > 
> > > Thanks
> > > Alex.


Re: Nutch in C++

Posted by "pepone.onrez" <pe...@gmail.com>.
Most probably memory usage will be reduced using c++ and this will
allow run in cheaper hardware. When nutch is network IO  bound will
depend of your network connection and storage hardware. You could move
the bottleneck to other place by getting a faster network and faster
storage hardware.

In any case i don't think this is worth do it.

On Tue, Aug 4, 2009 at 7:33 PM, Paul Tomblin<pt...@xcski.com> wrote:
> On Tue, Aug 4, 2009 at 1:35 PM, reinhard schwab<re...@aon.at> wrote:
>>> And why?  I guess you may see some performance improvement, but it would be
>>> a LOT cheaper to throw hardware at the problem (and you may not see much if
>>> any).
>>>
>> performance improvement?
>> can you proove that c++ will be faster?
>
> Considering that Nutch is mostly network IO bound, rewriting it in a
> different language isn't going to make the Internet serve up your
> pages faster.
>
> --
> http://www.linkedin.com/in/paultomblin
>

Re: Nutch in C++

Posted by Paul Tomblin <pt...@xcski.com>.
On Tue, Aug 4, 2009 at 1:35 PM, reinhard schwab<re...@aon.at> wrote:
>> And why?  I guess you may see some performance improvement, but it would be
>> a LOT cheaper to throw hardware at the problem (and you may not see much if
>> any).
>>
> performance improvement?
> can you proove that c++ will be faster?

Considering that Nutch is mostly network IO bound, rewriting it in a
different language isn't going to make the Internet serve up your
pages faster.

-- 
http://www.linkedin.com/in/paultomblin

pagination of rss results

Posted by al...@aim.com.
Hello,

I try to paginate results obtained by using opensearch rss. To do this I need totalResults in the rss feed that comes as

<opensearch:totalResults>100</opensearch:totalResults>

However, in php's simple_xml_load file results I do not see this part of the feed. Does someone know how to get totalResults from this feed in php application?

Thanks in advance.
Alex.




Re: Nutch in C++

Posted by Lukáš Vlček <lu...@gmail.com>.
Hi,
Look at HBase vs Hypertable. Both are implemenations of the same concept
(BigTable). HBase is in Java, Hypertable is in C++. Search the web and you
can find tons of flame discussions. I am not sure one can really say that
one implementation is superior to the other, mainly due to the fact that
both the projects are still very young and each community focuse on
different implementation priorities. Could Nutch benefit from similar flame
wars? Wouldn't it be more like energy waste? Pragmatic approach would be to
identify bottle necks in current Nutch code and try to improve its Java
implementation, if this is not possible and C++ implementation of critical
functionality would provide significant overall performance boost then this
can be a valid justification...

Regards,
Lukas

http://blog.lukas-vlcek.com/


On Wed, Aug 5, 2009 at 12:45 AM, Iain Downs <ia...@idcl.co.uk> wrote:

> I wasn't advocating this.
>
> ' (and you may not see much if any)'.
>
> Comparisons of managed languages vs C++ seem to have widely varied results.
> Some claim the managed language is faster, some that it is slower.
>
> The simple tests I've done with C# (which is sort of like java but faster
> ... [no flames please.  I don't really care if this statement is true or
> not!]) make me think C++ is 1.5 to 2 times faster for array intensive work
> -
> mainly because it checks the bounds a lot.  And I would guess that some of
> Nutch falls into this category, but by no means all.
>
> Personally, I would guess that you could get some 10-20 percent higher
> throughput if Nutch and Lucene were all native C++.  But then you would
> have
> taken twice as long to write the code.
>
> And I find writing in managed languages (Java, .net) so much less
> frustrating and so much more productive, that any small performance gains
> are irrelevant!
>
> Iain
>
> -----Original Message-----
> From: reinhard schwab [mailto:reinhard.schwab@aon.at]
> Sent: 04 August 2009 18:36
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch in C++
>
> Iain Downs schrieb:
> > I think there is probably a sub text here (I'm putting words in Otis'
> mouth,
> > for which my apologies).
> >
> > ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But
> you'd
> > be mad to do so!
> >
> > I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not
> an
> > easy conversion because of the different memory management systems.
> >
> > And why?  I guess you may see some performance improvement, but it would
> be
> > a LOT cheaper to throw hardware at the problem (and you may not see much
> if
> > any).
> >
> performance improvement?
> can you proove that c++ will be faster?
> > So if you have a few months to spare ....
> >
> >
> > Iain
> >
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com]
> > Sent: 04 August 2009 04:49
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Nutch in C++
> >
> > CLucene is just like Lucene (except a few versions behind), but written
> in
> > C++.
> >
> > Yes, you could rewrite Nutch in C++ and have that use CLucene.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > ----- Original Message ----
> >
> >> From: "alxsss@aim.com" <al...@aim.com>
> >> To: nutch-user@lucene.apache.org
> >> Sent: Monday, August 3, 2009 2:29:40 PM
> >> Subject: Re: Nutch in C++
> >>
> >>
> >>
> >>
> >>
> >> Hi,
> >>
> >> I know nutch uses Lucene. But for what is Clucene then? Only for
> indexing
> >>
> > files
> >
> >> in a hard drive?
> >>
> >>
> >> I have knowledge of C++ and some experience. I wanted to code crawler of
> >>
> > Nutch
> >
> >> in C++ to get more experience and make it open source, only if it l be
> >>
> > useful
> >
> >> for the open source community.
> >> My goal is to get more experience in C++ and make? contribution to open
> >>
> > source.
> >
> >> If you know other projects that may be more useful, please let me know.
> >>
> >> thanks.
> >> Alex.
> >>
> >>
> >> -----Original Message-----
> >> From: Otis Gospodnetic
> >> To: nutch-user@lucene.apache.org
> >> Sent: Sun, Aug 2, 2009 8:15 pm
> >> Subject: Re: Nutch in C++
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Nutch uses Lucene (Java), not CLucene (C++).
> >>
> >> Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
> >>
> >> Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> ----- Original Message ----
> >>
> >>> From: "alxsss@aim.com"
> >>> To: nutch-user@lucene.apache.org
> >>> Sent: Thursday, July 30, 2009 3:13:16 PM
> >>> Subject: Nutch in C++
> >>>
> >>> Hi,
> >>>
> >>> As I understood only indexing part of nutch is in C++ as clucene.? I
> >>>
> > want to
> >
> >>> code? nutch in C++, only in case if it is worth doing that.? I wondered
> >>>
> > if is
> >
> >>> worth coding the remaining parts of nutch in C++, let say the crawler.
> >>>
> > Can
> >
> >>> someone give me directions on what to start.
> >>>
> >>> Thanks
> >>> Alex.
> >>>
> >
> >
> >
>
>

RE: Nutch in C++

Posted by Iain Downs <ia...@idcl.co.uk>.
I wasn't advocating this.

' (and you may not see much if any)'.

Comparisons of managed languages vs C++ seem to have widely varied results.
Some claim the managed language is faster, some that it is slower.

The simple tests I've done with C# (which is sort of like java but faster
... [no flames please.  I don't really care if this statement is true or
not!]) make me think C++ is 1.5 to 2 times faster for array intensive work -
mainly because it checks the bounds a lot.  And I would guess that some of
Nutch falls into this category, but by no means all.

Personally, I would guess that you could get some 10-20 percent higher
throughput if Nutch and Lucene were all native C++.  But then you would have
taken twice as long to write the code.

And I find writing in managed languages (Java, .net) so much less
frustrating and so much more productive, that any small performance gains
are irrelevant!

Iain

-----Original Message-----
From: reinhard schwab [mailto:reinhard.schwab@aon.at] 
Sent: 04 August 2009 18:36
To: nutch-user@lucene.apache.org
Subject: Re: Nutch in C++

Iain Downs schrieb:
> I think there is probably a sub text here (I'm putting words in Otis'
mouth,
> for which my apologies).
>
> ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But
you'd
> be mad to do so!
>
> I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not
an
> easy conversion because of the different memory management systems.
>
> And why?  I guess you may see some performance improvement, but it would
be
> a LOT cheaper to throw hardware at the problem (and you may not see much
if
> any).
>   
performance improvement?
can you proove that c++ will be faster?
> So if you have a few months to spare ....
>
>
> Iain
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com] 
> Sent: 04 August 2009 04:49
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch in C++
>
> CLucene is just like Lucene (except a few versions behind), but written in
> C++.
>
> Yes, you could rewrite Nutch in C++ and have that use CLucene.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>   
>> From: "alxsss@aim.com" <al...@aim.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Monday, August 3, 2009 2:29:40 PM
>> Subject: Re: Nutch in C++
>>
>>
>>
>>
>>
>> Hi,
>>
>> I know nutch uses Lucene. But for what is Clucene then? Only for indexing
>>     
> files 
>   
>> in a hard drive?
>>
>>
>> I have knowledge of C++ and some experience. I wanted to code crawler of
>>     
> Nutch 
>   
>> in C++ to get more experience and make it open source, only if it l be
>>     
> useful 
>   
>> for the open source community.
>> My goal is to get more experience in C++ and make? contribution to open
>>     
> source. 
>   
>> If you know other projects that may be more useful, please let me know.
>>
>> thanks.
>> Alex.
>>
>>
>> -----Original Message-----
>> From: Otis Gospodnetic 
>> To: nutch-user@lucene.apache.org
>> Sent: Sun, Aug 2, 2009 8:15 pm
>> Subject: Re: Nutch in C++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Nutch uses Lucene (Java), not CLucene (C++).
>>
>> Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
>>
>> Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>     
>>> From: "alxsss@aim.com" 
>>> To: nutch-user@lucene.apache.org
>>> Sent: Thursday, July 30, 2009 3:13:16 PM
>>> Subject: Nutch in C++
>>>
>>> Hi,
>>>
>>> As I understood only indexing part of nutch is in C++ as clucene.? I
>>>       
> want to 
>   
>>> code? nutch in C++, only in case if it is worth doing that.? I wondered
>>>       
> if is 
>   
>>> worth coding the remaining parts of nutch in C++, let say the crawler.
>>>       
> Can 
>   
>>> someone give me directions on what to start.
>>>
>>> Thanks
>>> Alex.
>>>       
>
>
>   


Re: Nutch in C++

Posted by reinhard schwab <re...@aon.at>.
Iain Downs schrieb:
> I think there is probably a sub text here (I'm putting words in Otis' mouth,
> for which my apologies).
>
> ' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But you'd
> be mad to do so!
>
> I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not an
> easy conversion because of the different memory management systems.
>
> And why?  I guess you may see some performance improvement, but it would be
> a LOT cheaper to throw hardware at the problem (and you may not see much if
> any).
>   
performance improvement?
can you proove that c++ will be faster?
> So if you have a few months to spare ....
>
>
> Iain
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com] 
> Sent: 04 August 2009 04:49
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch in C++
>
> CLucene is just like Lucene (except a few versions behind), but written in
> C++.
>
> Yes, you could rewrite Nutch in C++ and have that use CLucene.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>   
>> From: "alxsss@aim.com" <al...@aim.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Monday, August 3, 2009 2:29:40 PM
>> Subject: Re: Nutch in C++
>>
>>
>>
>>
>>
>> Hi,
>>
>> I know nutch uses Lucene. But for what is Clucene then? Only for indexing
>>     
> files 
>   
>> in a hard drive?
>>
>>
>> I have knowledge of C++ and some experience. I wanted to code crawler of
>>     
> Nutch 
>   
>> in C++ to get more experience and make it open source, only if it l be
>>     
> useful 
>   
>> for the open source community.
>> My goal is to get more experience in C++ and make? contribution to open
>>     
> source. 
>   
>> If you know other projects that may be more useful, please let me know.
>>
>> thanks.
>> Alex.
>>
>>
>> -----Original Message-----
>> From: Otis Gospodnetic 
>> To: nutch-user@lucene.apache.org
>> Sent: Sun, Aug 2, 2009 8:15 pm
>> Subject: Re: Nutch in C++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Nutch uses Lucene (Java), not CLucene (C++).
>>
>> Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
>>
>> Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>     
>>> From: "alxsss@aim.com" 
>>> To: nutch-user@lucene.apache.org
>>> Sent: Thursday, July 30, 2009 3:13:16 PM
>>> Subject: Nutch in C++
>>>
>>> Hi,
>>>
>>> As I understood only indexing part of nutch is in C++ as clucene.? I
>>>       
> want to 
>   
>>> code? nutch in C++, only in case if it is worth doing that.? I wondered
>>>       
> if is 
>   
>>> worth coding the remaining parts of nutch in C++, let say the crawler.
>>>       
> Can 
>   
>>> someone give me directions on what to start.
>>>
>>> Thanks
>>> Alex.
>>>       
>
>
>   


RE: Nutch in C++

Posted by Iain Downs <ia...@idcl.co.uk>.
I think there is probably a sub text here (I'm putting words in Otis' mouth,
for which my apologies).

' Yes, you could rewrite Nutch in C++ and have that use CLucene.'  But you'd
be mad to do so!

I'm a bit out of date with Nutch, but it's large.  And Java to C++ is not an
easy conversion because of the different memory management systems.

And why?  I guess you may see some performance improvement, but it would be
a LOT cheaper to throw hardware at the problem (and you may not see much if
any).

So if you have a few months to spare ....


Iain

-----Original Message-----
From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com] 
Sent: 04 August 2009 04:49
To: nutch-user@lucene.apache.org
Subject: Re: Nutch in C++

CLucene is just like Lucene (except a few versions behind), but written in
C++.

Yes, you could rewrite Nutch in C++ and have that use CLucene.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: "alxsss@aim.com" <al...@aim.com>
> To: nutch-user@lucene.apache.org
> Sent: Monday, August 3, 2009 2:29:40 PM
> Subject: Re: Nutch in C++
> 
> 
> 
> 
> 
> Hi,
> 
> I know nutch uses Lucene. But for what is Clucene then? Only for indexing
files 
> in a hard drive?
> 
> 
> I have knowledge of C++ and some experience. I wanted to code crawler of
Nutch 
> in C++ to get more experience and make it open source, only if it l be
useful 
> for the open source community.
> My goal is to get more experience in C++ and make? contribution to open
source. 
> If you know other projects that may be more useful, please let me know.
> 
> thanks.
> Alex.
> 
> 
> -----Original Message-----
> From: Otis Gospodnetic 
> To: nutch-user@lucene.apache.org
> Sent: Sun, Aug 2, 2009 8:15 pm
> Subject: Re: Nutch in C++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Nutch uses Lucene (Java), not CLucene (C++).
> 
> Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
> > From: "alxsss@aim.com" 
> > To: nutch-user@lucene.apache.org
> > Sent: Thursday, July 30, 2009 3:13:16 PM
> > Subject: Nutch in C++
> > 
> > Hi,
> > 
> > As I understood only indexing part of nutch is in C++ as clucene.? I
want to 
> > code? nutch in C++, only in case if it is worth doing that.? I wondered
if is 
> > worth coding the remaining parts of nutch in C++, let say the crawler.
Can 
> > someone give me directions on what to start.
> > 
> > Thanks
> > Alex.


Re: Nutch in C++

Posted by Otis Gospodnetic <og...@yahoo.com>.
CLucene is just like Lucene (except a few versions behind), but written in C++.

Yes, you could rewrite Nutch in C++ and have that use CLucene.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: "alxsss@aim.com" <al...@aim.com>
> To: nutch-user@lucene.apache.org
> Sent: Monday, August 3, 2009 2:29:40 PM
> Subject: Re: Nutch in C++
> 
> 
> 
> 
> 
> Hi,
> 
> I know nutch uses Lucene. But for what is Clucene then? Only for indexing files 
> in a hard drive?
> 
> 
> I have knowledge of C++ and some experience. I wanted to code crawler of Nutch 
> in C++ to get more experience and make it open source, only if it l be useful 
> for the open source community.
> My goal is to get more experience in C++ and make? contribution to open source. 
> If you know other projects that may be more useful, please let me know.
> 
> thanks.
> Alex.
> 
> 
> -----Original Message-----
> From: Otis Gospodnetic 
> To: nutch-user@lucene.apache.org
> Sent: Sun, Aug 2, 2009 8:15 pm
> Subject: Re: Nutch in C++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Nutch uses Lucene (Java), not CLucene (C++).
> 
> Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
> > From: "alxsss@aim.com" 
> > To: nutch-user@lucene.apache.org
> > Sent: Thursday, July 30, 2009 3:13:16 PM
> > Subject: Nutch in C++
> > 
> > Hi,
> > 
> > As I understood only indexing part of nutch is in C++ as clucene.? I want to 
> > code? nutch in C++, only in case if it is worth doing that.? I wondered if is 
> > worth coding the remaining parts of nutch in C++, let say the crawler. Can 
> > someone give me directions on what to start.
> > 
> > Thanks
> > Alex.


Re: Nutch in C++

Posted by al...@aim.com.
 


 Hi,

I know nutch uses Lucene. But for what is Clucene then? Only for indexing files in a hard drive?


 I have knowledge of C++ and some experience. I wanted to code crawler of Nutch in C++ to get more experience and make it open source, only if it l be useful for the open source community.
My goal is to get more experience in C++ and make? contribution to open source. If you know other projects that may be more useful, please let me know.

thanks.
Alex.


-----Original Message-----
From: Otis Gospodnetic <og...@yahoo.com>
To: nutch-user@lucene.apache.org
Sent: Sun, Aug 2, 2009 8:15 pm
Subject: Re: Nutch in C++










Nutch uses Lucene (Java), not CLucene (C++).

Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: "alxsss@aim.com" <al...@aim.com>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, July 30, 2009 3:13:16 PM
> Subject: Nutch in C++
> 
> Hi,
> 
> As I understood only indexing part of nutch is in C++ as clucene.? I want to 
> code? nutch in C++, only in case if it is worth doing that.? I wondered if is 
> worth coding the remaining parts of nutch in C++, let say the crawler. Can 
> someone give me directions on what to start.
> 
> Thanks
> Alex.




 


Re: Nutch in C++

Posted by Otis Gospodnetic <og...@yahoo.com>.
Nutch uses Lucene (Java), not CLucene (C++).

Why are you looking to rewrite Nutch in C++ anyway?  Sounds scary.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: "alxsss@aim.com" <al...@aim.com>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, July 30, 2009 3:13:16 PM
> Subject: Nutch in C++
> 
> Hi,
> 
> As I understood only indexing part of nutch is in C++ as clucene.? I want to 
> code? nutch in C++, only in case if it is worth doing that.? I wondered if is 
> worth coding the remaining parts of nutch in C++, let say the crawler. Can 
> someone give me directions on what to start.
> 
> Thanks
> Alex.


Nutch in C++

Posted by al...@aim.com.
Hi,

As I understood only indexing part of nutch is in C++ as clucene.? I want to code? nutch in C++, only in case if it is worth doing that.? I wondered if is worth coding the remaining parts of nutch in C++, let say the crawler. Can someone give me directions on what to start.

Thanks
Alex.