You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Veselin K <ve...@campbell-lange.net> on 2009/04/05 14:09:26 UTC

Re: Indexing local PDF/Doc/XLS files with Solr?

Hello, I think the latest tarball worked for me out of the box.

I'm trying to design my Schema at present.
My goal is to index PDF/Doc/XLS files with the following fields:

0. ID number
1. Filename
2. File path
3. Modification date
4. File contents 
5. Number of pages

- Any tips on what type of fields should I use to get this data indexed?

- Is there a way to get the ID number incremented automatically by Solr,
  each time a document is added to the index?

- Would I be able to extract the information above using just the
  Solr/Tika features? Or would I have to source all values myself, except
  "file contents" and pass them to solr when indexing?


Thank you much.

Regards,
Veselin K


On Sat, Dec 27, 2008 at 09:29:05PM -0500, Grant Ingersoll wrote:
> Can you provide details about the part of the examples that weren't  
> clear?  Perhaps I can clean up the docs or help you figure it out.
>
> -Grant
>
> On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote:
>
>> Hello,
>> I am now using solr 1.3 with tomcat6 on a debian lenny box.
>>
>> Could you please advise of any other instructions/HowTos on  
>> integrating Tika or
>> maybe RichDocumentHandler with Solr, that I can find online?
>> Apart from the Solr Wiki, as following those examples did not help in 
>> my
>> case.
>>
>>
>> Thank you.
>>
>> Veselin K.
>>
>>
>> On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
>>> Thank you Erik, Hoss.
>>>
>>> - If using either Solr's "stream.file" or Nutch's crawler,
>>>  what is the procedure of adding new files?
>>>  That is to say, if I did not know which are the new files in a
>>>  specific folder and thus I passed all files to Solr/Nutch, would it
>>>  skip the ones that have already been indexed?
>>>
>>> - Also what if I file gets modified, would Solr/Nutch detect
>>>  the change and re-index just this modified the file?
>>>  Or should some kind of cache be cleared and everything re-indexed?
>>>
>>> - In order to provide the user with an option to search the indexes  
>>> of
>>>  two separete Solr/Nutch servers, do I need to link both servers
>>>  somehow and join their indexes into one, or is it just a question of
>>>  designing the web front-end so that it offers the choice to send  
>>> your
>>>  search query to one or multiple different servers.
>>>
>>>
>>> Thank you,
>>> Veselin K
>>>
>>>
>>> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
>>>>
>>>> : the easiest way to get rolling.  A simple script that recurses  
>>>> your folders
>>>> : and issues a simple request posting each file in turn to Solr  
>>>> will give you a
>>>> : full text searchable index in no time (well, ok, it'll take a  
>>>> little time, but
>>>> : it'll be as fast as anything else out there).
>>>>
>>>> if all the files are "local" on the machine that Solr is running  
>>>> on you
>>>> don't even need to POST them, Solr can be configured to read the  
>>>> files by
>>>> local filename using the "stream.file" param...
>>>>
>>>> 	http://wiki.apache.org/solr/ContentStream
>>>>
>>>> that said: if your fileserver implementation already exposes all  
>>>> of the
>>>> files over HTTP, then using Nutch and it's crawler might be an  
>>>> easier way
>>>> to get started on indexing all of them ... hard to say without  
>>>> being in
>>>> your shoes.  you may want to experiement with both.
>>>>
>>>>
>>>>
>>>> -Hoss
>>>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>

Re: Indexing local PDF/Doc/XLS files with Solr?

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm trying to design my Schema at present.
: My goal is to index PDF/Doc/XLS files with the following fields:

I strongly suggest you ask these questions on the solr-user@lucene mailing 
list.  general@lucene is for general discussions about all lucene 
projects, or for questions from people interested in "search" related 
technologies who don't yet know what Lucene subproject(s) might be useful 
for them.


-Hoss