You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Sadler, Anthony" <An...@yrgrp.com> on 2013/10/04 07:29:21 UTC

Indexing file system contents

Hi all:

I've had a quick look through the archives but am struggling to find a decent search query (a bad start to my solr career), so apologies if this has been asked multiple times before, as I'm sure it has.

We've got several windows file servers across several locations and we'd like to index their contents using Solr. So far we've come up with this setup:

- 1 Solr server with several collections, collections segregated by file security needs or line of business.
- At each remote site a linux machine has mounted the relevant local fileserver's filesystem via SMB/CIFS.
- That server is running a perl script written by yours truly that creates an XML index of all the files and then submits them to Solr for indexing. Content of files with certain extensions is indexed using Tika. Happy to post this script.

The script is fairly mature and has a few smarts in it, like being able to do delta updates (not in the solr sense of the word: It'll do a full scan of a file system then write out a timestamp. Next time it runs it only grabs files modified since that timestamp). This works... to a point. There are these problems:

---------------------------------------------------------------------------------------------------------------------------------------

Time:
-----
On some servers we're dealing with something in the region of a million or more files. Indexing that many times takes upwards of 48 hours or more. While the script is now fairly stable and fault tolerant, that is still a pretty long time. Part of the reason of the slowness is the content indexing by Tika, but I've been unable to find a satisfactory alternative. We could drop the whole content thing, but then what's the point? Half the beauty of solr/tika is that we >can< do it.

Projecting from some averages, it'd take the better part of a week to index one of our file servers.

Deletes:
--------
As explained above, once the initial scan takes place all activity thereafter is limited to files that have changed since $last_run_time. However this present a problem in that if a file gets deleted from the file server, we're still going to see it in the search results. There are a few ways that I can see to get rid of these stale files, but they either won't work or are evil:

- Re-index the server. Evil because it'll take half a week.
- Use some filesystem watcher to watch for deletes. Won't work because we're using a SMB/CIFS share mount.
- Periodically list all the files on the fileserver, diff that against all the files stored in Solr and delete the differences from Solr, thereby syncing the two. Evil because... well it just is. I'd be asking Solr for every record it has, which'll be a doozy of a return variable. Surely there has to be a more elegant way?

Security:
---------
We've worked around this by not indexing some files or separating out into various collections. As such it is not a huge problem, but has anyone figured out how to integrate Solr with LDAP?

---------------------------------------------------------------------------------------------------------------------------------------

DIH:
----
Someone will reasonably ask why we're not using the DIH. I tried using that but found the following:
- It would crash.
- When I stopped it crashing by using the on-error stuff, both in the Tika subsection and the main part of the DIH config, it still crashed with a java-out-of-memory error.
- I gave java more memory but it still crashed.

At that point I gave up for the following reasons:
- DIH and I were not getting along.
- Java and I were not getting along.
- Java and DIH were not getting along.
- All the doco I could find was either really basic or really advanced... there was no intermediate stuff as far as I could find.
- I realised that I could do what I wanted to do better using perl than I could with DIH, and this seemed a better solution.

The perl script has, by and large, been a success. However, we've run up against the above problems.

Which now leads me to my ultimate question. Surely other people have been in this same situation. How did they solve these issues? Is the slow indexing time simply a function of the large dataset we're wanting to index? Do we need to throw more oomph at the servers?

The more I play with Solr, the more I realise I need to learn and the more I realise I'm way out of my depth, hence this email.

Thanks

Anthony

________________________________

==========================================
Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it.
==========================================

Re: Indexing file system contents

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/3/2013 11:29 PM, Sadler, Anthony wrote:
> Time:
> -----
> On some servers we're dealing with something in the region of a million or more files. Indexing that many times takes upwards of 48 hours or more. While the script is now fairly stable and fault tolerant, that is still a pretty long time. Part of the reason of the slowness is the content indexing by Tika, but I've been unable to find a satisfactory alternative. We could drop the whole content thing, but then what's the point? Half the beauty of solr/tika is that we >can< do it.
> 
> Projecting from some averages, it'd take the better part of a week to index one of our file servers.

You might have already thought of this, and even done testing that
proved it wasn't much of a problem, but some of it might be network
latency in dealing with the SMB filesystem.  Not necessarily data
transfer time, but the latency of finding files and navigating the
directory structure.  I'll be getting back to this in a moment.

> Deletes:
> --------
> As explained above, once the initial scan takes place all activity thereafter is limited to files that have changed since $last_run_time. However this present a problem in that if a file gets deleted from the file server, we're still going to see it in the search results. There are a few ways that I can see to get rid of these stale files, but they either won't work or are evil:
> 
> - Re-index the server. Evil because it'll take half a week.
> - Use some filesystem watcher to watch for deletes. Won't work because we're using a SMB/CIFS share mount.
> - Periodically list all the files on the fileserver, diff that against all the files stored in Solr and delete the differences from Solr, thereby syncing the two. Evil because... well it just is. I'd be asking Solr for every record it has, which'll be a doozy of a return variable. Surely there has to be a more elegant way?

I would recommend that what you do is remove all unix-isms from your
perl script so that it is written in pure perl, and then run it directly
on each windows fileserver rather than on your Solr server.  What gets
indexed for adds, changes, and deletes could be driven by a locally
running service that can watch the filesystem.

This would pretty much eliminate SMB network latency for filesystem
navigation.  You also would not need the SMB mounts, the only traffic
would be relatively short-lived HTTP transactions on the Solr port.

> Security:
> ---------
> We've worked around this by not indexing some files or separating out into various collections. As such it is not a huge problem, but has anyone figured out how to integrate Solr with LDAP?

Earlier today I responded to a post on this list asking about security.
 The context of the question is very different from yours, but the
spirit of the reply applies here too.  Basically, Solr doesn't do security.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3C524DB8E5.3090903%40elyograg.org%3E

If you were running the perl script on the Windows machines directly,
you would probably have more options to control what can be accessed.

Thanks,
Shawn

Re: Indexing file system contents

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

No direct help but a bunch of related random thoughts:

1) How are you running Tika? As a jar loading from scratch every time? Tika
can also run in a server mode where it listens to a network socket. You
send the file, it sends the extract back. Might be faster.

2) Deleting old stuff. You can index into a new core and then swap the
cores out. Heavy on the server, but client will not notice. Or just reindex
into the same core but have a timestamp for index-time. Then delete with a
query for old timestamp (not reindexed)

3) DIH is ok, but getting old in a tooth. And you are kind of supposed to
grow out of it. Maybe look at Flume for more modern take.

4) Security: Maybe ManyfoldCF has something you can use:
http://projects.apache.org/projects/manifoldcf.html

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Oct 4, 2013 at 12:29 PM, Sadler, Anthony
<An...@yrgrp.com>wrote:

> Hi all:
>
> I've had a quick look through the archives but am struggling to find a
> decent search query (a bad start to my solr career), so apologies if this
> has been asked multiple times before, as I'm sure it has.
>
> We've got several windows file servers across several locations and we'd
> like to index their contents using Solr. So far we've come up with this
> setup:
>
> - 1 Solr server with several collections, collections segregated by file
> security needs or line of business.
> - At each remote site a linux machine has mounted the relevant local
> fileserver's filesystem via SMB/CIFS.
> - That server is running a perl script written by yours truly that creates
> an XML index of all the files and then submits them to Solr for indexing.
> Content of files with certain extensions is indexed using Tika. Happy to
> post this script.
>
> The script is fairly mature and has a few smarts in it, like being able to
> do delta updates (not in the solr sense of the word: It'll do a full scan
> of a file system then write out a timestamp. Next time it runs it only
> grabs files modified since that timestamp). This works... to a point. There
> are these problems:
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Time:
> -----
> On some servers we're dealing with something in the region of a million or
> more files. Indexing that many times takes upwards of 48 hours or more.
> While the script is now fairly stable and fault tolerant, that is still a
> pretty long time. Part of the reason of the slowness is the content
> indexing by Tika, but I've been unable to find a satisfactory alternative.
> We could drop the whole content thing, but then what's the point? Half the
> beauty of solr/tika is that we >can< do it.
>
> Projecting from some averages, it'd take the better part of a week to
> index one of our file servers.
>
> Deletes:
> --------
> As explained above, once the initial scan takes place all activity
> thereafter is limited to files that have changed since $last_run_time.
> However this present a problem in that if a file gets deleted from the file
> server, we're still going to see it in the search results. There are a few
> ways that I can see to get rid of these stale files, but they either won't
> work or are evil:
>
> - Re-index the server. Evil because it'll take half a week.
> - Use some filesystem watcher to watch for deletes. Won't work because
> we're using a SMB/CIFS share mount.
> - Periodically list all the files on the fileserver, diff that against all
> the files stored in Solr and delete the differences from Solr, thereby
> syncing the two. Evil because... well it just is. I'd be asking Solr for
> every record it has, which'll be a doozy of a return variable. Surely there
> has to be a more elegant way?
>
> Security:
> ---------
> We've worked around this by not indexing some files or separating out into
> various collections. As such it is not a huge problem, but has anyone
> figured out how to integrate Solr with LDAP?
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> DIH:
> ----
> Someone will reasonably ask why we're not using the DIH. I tried using
> that but found the following:
> - It would crash.
> - When I stopped it crashing by using the on-error stuff, both in the Tika
> subsection and the main part of the DIH config, it still crashed with a
> java-out-of-memory error.
> - I gave java more memory but it still crashed.
>
> At that point I gave up for the following reasons:
> - DIH and I were not getting along.
> - Java and I were not getting along.
> - Java and DIH were not getting along.
> - All the doco I could find was either really basic or really advanced...
> there was no intermediate stuff as far as I could find.
> - I realised that I could do what I wanted to do better using perl than I
> could with DIH, and this seemed a better solution.
>
> The perl script has, by and large, been a success. However, we've run up
> against the above problems.
>
> Which now leads me to my ultimate question. Surely other people have been
> in this same situation. How did they solve these issues? Is the slow
> indexing time simply a function of the large dataset we're wanting to
> index? Do we need to throw more oomph at the servers?
>
> The more I play with Solr, the more I realise I need to learn and the more
> I realise I'm way out of my depth, hence this email.
>
> Thanks
>
> Anthony
>
> ________________________________
>
> ==========================================
> Privileged/Confidential Information may be contained in this message. If
> you are not the addressee indicated in this message (or responsible for
> delivery of the message to such person), you may not copy or deliver this
> message to anyone. In such case, you should destroy this message and kindly
> notify the sender by reply email. Please advise immediately if you or your
> employer does not consent to email for messages of this kind. Opinions,
> conclusions and other information in this message that do not relate to the
> official business of Burson-Marsteller shall be understood as neither given
> nor endorsed by it.
> ==========================================
>
>