You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andrew Naylor <na...@gmail.com> on 2011/08/15 04:41:11 UTC

desktop search

Any suggestions for the best way to get desktop search in the
Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from my own
program) lists of terms that are indexed and weights for each file, for
example, but if a filesystem indexer and index updater already exists
somewhere I'd like to use it rather than write my own.

I'm planning on working in Clojure, btw, not that that should make any
difference---

Thanks,

Andrew

Re: desktop search

Posted by Andrew Naylor <na...@gmail.com>.
> If immediate reindexing of modified documents is strictly required you may
> need to drop Nutch and go for a stand-alone Solr with a lot of scripting
and
> some file alteration monitor you can use cross-platform.

Thanks Marcus, I'll see if I really need that.  One thing I might do is
simply use an existing desktop indexer and just use Tika to parse files
(mostly I want to get a list of indexed terms).

On Mon, Aug 15, 2011 at 1:43 PM, Markus Jelsma
<ma...@openindex.io>wrote:

>
> > The KDE thing is very interesting, thanks for the link!  I wash hoping
> for
> > something cross-platform though.
>
> KDE is almost pure QT so most of it is cross-platform. You might want to
> check
> with their lists for details and feasibility.
>
> >
> > As regards using Nutch: how would it handle file updates?  It seems to me
> a
> > Web crawler would only get new files and changes on each crawl, whereas a
> > desktop search engine like Spotlight for instance indexes a file as soon
> as
> > it gets made or modified.
>
> Nutch will crawl a (local) url and increment a timestamp with a constant
> (default 30 days) or based on some algorithm; the fetch time. At this time
> in
> the future the url becomes eligible for refetch all the associated
> processing.
>
> You can also hook-up some file alteration monitor daemon that can run some
> script to reindex a specific file in Solr. This cannot be used with Nutch,
> it
> will not recrawl and index an url if it is not eligible for fetch.
> This is not a big problem as both Nutch and Solr use the Tika libs for
> document parsing but may become a problem is both use different versions
> and
> if you have custom Nutch pluging.
> To be short: forced reindexing of a given url cannot go through Nutch.
>
> >
> > There's also this document I found on the Web: it describes some problems
> > with using Nutch on the personal scale owing to its specialization for
> web
> > crawling----it says there is a limit on files crawled per directory, and
> > size of files crawled.  This was all I was able to find under "Nutch
> > desktop search" in Google.  However, now that I look at it more closely
> > it's from 2004, so it seems to me Nutch might have gotten rid of these
> > problems in the interim....
>
> There are limits indeed but they are configurable, num outlinks (applies to
> directory lists as well) and max content limit and such.
>
> If immediate reindexing of modified documents is strictly required you may
> need to drop Nutch and go for a stand-alone Solr with a lot of scripting
> and
> some file alteration monitor you can use cross-platform.
>
> Good luck
>
> >
> >
> http://docs.google.com/viewer?a=v&q=cache:bDjjs__eYPcJ:www.commercenet.com/
> >
> images/0/06/CN-TR-04-04.pdf+nutch+desktop+search&hl=en&gl=us&pid=bl&srcid=A
> >
> DGEESg12Bq0VDGk3FpevwOHIdbfr1bCkEZ3CH1yojEliyfeCJv_3JhGRe1gMPx66LiywsUYFWJh
> >
> KKzsLBVoCtATNcghrW4DRLWlT5sd4YhIWMVaQjMKs5xN-8vqTOHFV2pw9bzCtoQY&sig=AHIEtb
> > TpxSL0xmZJxa5CWm8MzDWD4vyAAg
> >
> > Thanks,
> >
> > Andrew
> >
> > On Mon, Aug 15, 2011 at 6:07 AM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > With Nutch you can crawl your FS with ease and index to a Solr
> instance.
> > > It'll
> > > surely work. But you may also be interested in the cool KDE
> technologies
> > > that
> > > are specifically built for desktop search.
> > >
> > >
> http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi-
> > > explained/
> > >
> > > On Monday 15 August 2011 04:41:11 Andrew Naylor wrote:
> > > > Any suggestions for the best way to get desktop search in the
> > > > Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from
> my
> > >
> > > own
> > >
> > > > program) lists of terms that are indexed and weights for each file,
> for
> > > > example, but if a filesystem indexer and index updater already exists
> > > > somewhere I'd like to use it rather than write my own.
> > > >
> > > > I'm planning on working in Clojure, btw, not that that should make
> any
> > > > difference---
> > > >
> > > > Thanks,
> > > >
> > > > Andrew
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>

Re: desktop search

Posted by Markus Jelsma <ma...@openindex.io>.
> The KDE thing is very interesting, thanks for the link!  I wash hoping for
> something cross-platform though.

KDE is almost pure QT so most of it is cross-platform. You might want to check 
with their lists for details and feasibility.

> 
> As regards using Nutch: how would it handle file updates?  It seems to me a
> Web crawler would only get new files and changes on each crawl, whereas a
> desktop search engine like Spotlight for instance indexes a file as soon as
> it gets made or modified.

Nutch will crawl a (local) url and increment a timestamp with a constant 
(default 30 days) or based on some algorithm; the fetch time. At this time in 
the future the url becomes eligible for refetch all the associated processing.

You can also hook-up some file alteration monitor daemon that can run some 
script to reindex a specific file in Solr. This cannot be used with Nutch, it 
will not recrawl and index an url if it is not eligible for fetch.
This is not a big problem as both Nutch and Solr use the Tika libs for 
document parsing but may become a problem is both use different versions and 
if you have custom Nutch pluging.
To be short: forced reindexing of a given url cannot go through Nutch.

> 
> There's also this document I found on the Web: it describes some problems
> with using Nutch on the personal scale owing to its specialization for web
> crawling----it says there is a limit on files crawled per directory, and
> size of files crawled.  This was all I was able to find under "Nutch
> desktop search" in Google.  However, now that I look at it more closely
> it's from 2004, so it seems to me Nutch might have gotten rid of these
> problems in the interim....

There are limits indeed but they are configurable, num outlinks (applies to 
directory lists as well) and max content limit and such.

If immediate reindexing of modified documents is strictly required you may 
need to drop Nutch and go for a stand-alone Solr with a lot of scripting and 
some file alteration monitor you can use cross-platform.

Good luck

> 
> http://docs.google.com/viewer?a=v&q=cache:bDjjs__eYPcJ:www.commercenet.com/
> images/0/06/CN-TR-04-04.pdf+nutch+desktop+search&hl=en&gl=us&pid=bl&srcid=A
> DGEESg12Bq0VDGk3FpevwOHIdbfr1bCkEZ3CH1yojEliyfeCJv_3JhGRe1gMPx66LiywsUYFWJh
> KKzsLBVoCtATNcghrW4DRLWlT5sd4YhIWMVaQjMKs5xN-8vqTOHFV2pw9bzCtoQY&sig=AHIEtb
> TpxSL0xmZJxa5CWm8MzDWD4vyAAg
> 
> Thanks,
> 
> Andrew
> 
> On Mon, Aug 15, 2011 at 6:07 AM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > With Nutch you can crawl your FS with ease and index to a Solr instance.
> > It'll
> > surely work. But you may also be interested in the cool KDE technologies
> > that
> > are specifically built for desktop search.
> > 
> > http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi-
> > explained/
> > 
> > On Monday 15 August 2011 04:41:11 Andrew Naylor wrote:
> > > Any suggestions for the best way to get desktop search in the
> > > Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from my
> > 
> > own
> > 
> > > program) lists of terms that are indexed and weights for each file, for
> > > example, but if a filesystem indexer and index updater already exists
> > > somewhere I'd like to use it rather than write my own.
> > > 
> > > I'm planning on working in Clojure, btw, not that that should make any
> > > difference---
> > > 
> > > Thanks,
> > > 
> > > Andrew
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Re: desktop search

Posted by Andrew Naylor <na...@gmail.com>.
The KDE thing is very interesting, thanks for the link!  I wash hoping for
something cross-platform though.

As regards using Nutch: how would it handle file updates?  It seems to me a
Web crawler would only get new files and changes on each crawl, whereas a
desktop search engine like Spotlight for instance indexes a file as soon as
it gets made or modified.

There's also this document I found on the Web: it describes some problems
with using Nutch on the personal scale owing to its specialization for web
crawling----it says there is a limit on files crawled per directory, and
size of files crawled.  This was all I was able to find under "Nutch desktop
search" in Google.  However, now that I look at it more closely it's from
2004, so it seems to me Nutch might have gotten rid of these problems in the
interim....

http://docs.google.com/viewer?a=v&q=cache:bDjjs__eYPcJ:www.commercenet.com/images/0/06/CN-TR-04-04.pdf+nutch+desktop+search&hl=en&gl=us&pid=bl&srcid=ADGEESg12Bq0VDGk3FpevwOHIdbfr1bCkEZ3CH1yojEliyfeCJv_3JhGRe1gMPx66LiywsUYFWJhKKzsLBVoCtATNcghrW4DRLWlT5sd4YhIWMVaQjMKs5xN-8vqTOHFV2pw9bzCtoQY&sig=AHIEtbTpxSL0xmZJxa5CWm8MzDWD4vyAAg

Thanks,

Andrew

On Mon, Aug 15, 2011 at 6:07 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> With Nutch you can crawl your FS with ease and index to a Solr instance.
> It'll
> surely work. But you may also be interested in the cool KDE technologies
> that
> are specifically built for desktop search.
>
> http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi-
> explained/
>
> On Monday 15 August 2011 04:41:11 Andrew Naylor wrote:
> > Any suggestions for the best way to get desktop search in the
> > Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from my
> own
> > program) lists of terms that are indexed and weights for each file, for
> > example, but if a filesystem indexer and index updater already exists
> > somewhere I'd like to use it rather than write my own.
> >
> > I'm planning on working in Clojure, btw, not that that should make any
> > difference---
> >
> > Thanks,
> >
> > Andrew
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: desktop search

Posted by Markus Jelsma <ma...@openindex.io>.
With Nutch you can crawl your FS with ease and index to a Solr instance. It'll 
surely work. But you may also be interested in the cool KDE technologies that 
are specifically built for desktop search.

http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi-
explained/

On Monday 15 August 2011 04:41:11 Andrew Naylor wrote:
> Any suggestions for the best way to get desktop search in the
> Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from my own
> program) lists of terms that are indexed and weights for each file, for
> example, but if a filesystem indexer and index updater already exists
> somewhere I'd like to use it rather than write my own.
> 
> I'm planning on working in Clojure, btw, not that that should make any
> difference---
> 
> Thanks,
> 
> Andrew

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350