You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "K. Seshadri Iyer" <se...@gmail.com> on 2010/11/12 05:38:44 UTC

A Newbie Question

Hi,

Pardon me if this sounds very elementary, but I have a very basic question
regarding Solr search. I have about 10 storage devices running Solaris with
hundreds of thousands of text files (there are other files, as well, but my
target is these text files). The directories on the Solaris boxes are
exported and are available as NFS mounts.

I have installed Solr 1.4 on a Linux box and have tested the installation,
using curl to post  documents. However, the manual says that curl is not the
recommended way of posting documents to Solr. Could someone please tell me
what is the preferred approach in such an environment? I am not a programmer
and would appreciate some hand-holding here :o)

Thanks in advance,

Sesh

Re: A Newbie Question

Posted by Lance Norskog <go...@gmail.com>.

"There is no current feature" is what I meant. Yes, it would be very 
handy to do this.

I handled this problem in the DIH by creating two documents, both with 
the same unique ID. The first doc just had the metadata. The second 
document parsed the input with Tika, but had 'skip doc on error' set. 
So, if the parsing worked, the parsed document overwrote the first 
document. If parsing failed, the metadata-only document went in.

Works quite well!

Ken Krugler wrote:
>
> On Nov 14, 2010, at 3:02pm, Lance Norskog wrote:
>
>> Yes, the ExtractingRequestHandler uses Tika to parse many file formats.
>>
>> Solr 1.4.1 uses a previous version of Tika (0.6 or 0.7).
>>
>> Here's the problem with Tika and extraction utilities in general: 
>> they are not perfect. They will fail on some files. In the 
>> ExtractingRequestHandler's case, there is no way to let it fail in 
>> parsing but save the document's metadata anyway with a notation: 
>> "sorry not parsed".
>
> By "there is no way" do you mean in configuring the current 
> ExtractingRequestHandler? Or is there some fundamental issue with how 
> Solr uses Tika that prevents ExtractingRequestHandler from being 
> modified to work this way (which seems like a useful configuration 
> settings)?
>
> Regards,
>
> -- Ken
>
>> I would rather have the unix 'strings' command parse my documents 
>> (thanks to a co-worker for this).
>>
>> K. Seshadri Iyer wrote:
>>> Thanks for all the responses.
>>>
>>> Govind: To answer your question, yes, all I want to search is plain 
>>> text
>>> files. They are located in NFS directories across multiple 
>>> Solaris/Linux
>>> storage boxes. The total storage is in hundreds of terabytes.
>>>
>>> I have just got started with Solr and my understanding is that I will
>>> somehow need Tika to help stream/upload files to Solr. I don't know 
>>> anything
>>> about Java programming, being a system admin. So far, I have read 
>>> that the
>>> autodetect parser in Tika will somehow detect the file type and I 
>>> can use
>>> the stream to populate Solr. How, that is still a mystery to me - 
>>> working on
>>> it. Any tips appreciated; thanks in advance.
>>>
>>> Sesh
>>>
>>>
>>>
>>> On 13 November 2010 15:24, Govind Kanshi<go...@gmail.com>  
>>> wrote:
>>>
>>>
>>>> Another pov you might want to think about - what kind of search you 
>>>> want.
>>>> Just plain - full text search or there is something more to those text
>>>> files. Are they grouped in folders? Do the folders imply certain 
>>>> kind of
>>>> grouping/hierarchy/tagging?
>>>>
>>>> I recently was trying to help somebody who had files across lot of 
>>>> places
>>>> grouped by date/subject/author - he wanted to ensure these are 
>>>> "fields"
>>>> which too can act as filters/navigators.
>>>>
>>>> Just an input - ignore it if you just want plain full text search.
>>>>
>>>> On Sat, Nov 13, 2010 at 11:25 AM, Lance Norskog<go...@gmail.com>  
>>>> wrote:
>>>>
>>>>
>>>>> About web servers: Solr is a servlet war file and needs a Java web 
>>>>> server
>>>>> "container" to run. The example/ folder in the Solr disribution uses
>>>>> 'Jetty', and this is fine for small production-quality projects.  
>>>>> You can
>>>>> just copy the example/ directory somewhere to set up your own running
>>>>>
>>>> Solr;
>>>>
>>>>> that's what I always do.
>>>>>
>>>>> About indexing programs: if you know Unix scripting, it may be 
>>>>> easiest to
>>>>> walk the file system yourself with the 'find' program and create Solr
>>>>>
>>>> input
>>>>
>>>>> XML files.
>>>>>
>>>>> But yes, you definitely want the Solr 1.4 Enterprise manual. I spent
>>>>>
>>>> months
>>>>
>>>>> learning this stuff very slowly, and the book would have been 
>>>>> great back
>>>>> then.
>>>>>
>>>>> Lance
>>>>>
>>>>>
>>>>> Erick Erickson wrote:
>>>>>
>>>>>
>>>>>> Think of the data import handler (DIH) as Solr pulling data to index
>>>>>> from some source based on configuration. So, once you set up
>>>>>> your DIH config to point to your file system, you issue a command
>>>>>> to solr like "OK, do your data import thing". See the
>>>>>> FileListEntityProcessor.
>>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>>>
>>>>>> <http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent 
>>>>>> library
>>>>>> you'd use to push data to Solr. Basically, you
>>>>>> write a Java program that uses SolrJ to walk the file system, find
>>>>>> documents, create a Solr document and sent that to Solr. It's not
>>>>>> nearly as complex as it sounds<G>. See:
>>>>>> http://wiki.apache.org/solr/Solrj
>>>>>>
>>>>>> <http://wiki.apache.org/solr/Solrj>It's probably worth your while to
>>>>>>
>>>> get
>>>>
>>>>>> a
>>>>>> copy of "Solr 1.4, Enterprise Search Server"
>>>>>> by Erik Pugh and David Smiley.
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri 
>>>>>> Iyer<seshadribpl@gmail.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Lance,
>>>>>>>
>>>>>>> Thank you very much for responding (not sure how I reply to the 
>>>>>>> group,
>>>>>>> so,
>>>>>>> writing to you).
>>>>>>>
>>>>>>> Can you please expand on your suggestion? I am not a web guy and 
>>>>>>> so,
>>>>>>> don't
>>>>>>> know where to start.
>>>>>>>
>>>>>>> What is the difference between SolrJ and DataImportHandler? Do I 
>>>>>>> need
>>>>>>>
>>>> to
>>>>
>>>>>>> set
>>>>>>> up web servers on all my storage boxes?
>>>>>>>
>>>>>>> Apologies for the basic level of questions, but hope I can get 
>>>>>>> started
>>>>>>> and
>>>>>>> implement this before the year end (you know why :o)
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Sesh
>>>>>>>
>>>>>>> On 12 November 2010 13:31, Lance Norskog<go...@gmail.com>   
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Using 'curl' is fine. There is a library called SolrJ for Java and
>>>>>>>> other libraries for other scripting languages that let you 
>>>>>>>> upload with
>>>>>>>> more control. There is a thing in Solr called the 
>>>>>>>> DataImportHandler
>>>>>>>> that lets you script walking a file system.
>>>>>>>>
>>>>>>>> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<
>>>>>>>>
>>>> seshadribpl@gmail.com
>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Pardon me if this sounds very elementary, but I have a very basic
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> question
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> regarding Solr search. I have about 10 storage devices running
>>>>>>>>>
>>>> Solaris
>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> with
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> hundreds of thousands of text files (there are other files, as 
>>>>>>>>> well,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> but
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> my
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> target is these text files). The directories on the Solaris 
>>>>>>>>> boxes are
>>>>>>>>> exported and are available as NFS mounts.
>>>>>>>>>
>>>>>>>>> I have installed Solr 1.4 on a Linux box and have tested the
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> installation,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> using curl to post  documents. However, the manual says that 
>>>>>>>>> curl is
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> not
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> recommended way of posting documents to Solr. Could someone 
>>>>>>>>> please
>>>>>>>>>
>>>> tell
>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> me
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> what is the preferred approach in such an environment? I am not a
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> programmer
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> and would appreciate some hand-holding here :o)
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>> Sesh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Lance Norskog
>>>>>>>> goksron@gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Re: A Newbie Question

Posted by Ken Krugler <kk...@transpac.com>.

On Nov 14, 2010, at 3:02pm, Lance Norskog wrote:

> Yes, the ExtractingRequestHandler uses Tika to parse many file  
> formats.
>
> Solr 1.4.1 uses a previous version of Tika (0.6 or 0.7).
>
> Here's the problem with Tika and extraction utilities in general:  
> they are not perfect. They will fail on some files. In the  
> ExtractingRequestHandler's case, there is no way to let it fail in  
> parsing but save the document's metadata anyway with a notation:  
> "sorry not parsed".

By "there is no way" do you mean in configuring the current  
ExtractingRequestHandler? Or is there some fundamental issue with how  
Solr uses Tika that prevents ExtractingRequestHandler from being  
modified to work this way (which seems like a useful configuration  
settings)?

Regards,

-- Ken

> I would rather have the unix 'strings' command parse my documents  
> (thanks to a co-worker for this).
>
> K. Seshadri Iyer wrote:
>> Thanks for all the responses.
>>
>> Govind: To answer your question, yes, all I want to search is plain  
>> text
>> files. They are located in NFS directories across multiple Solaris/ 
>> Linux
>> storage boxes. The total storage is in hundreds of terabytes.
>>
>> I have just got started with Solr and my understanding is that I will
>> somehow need Tika to help stream/upload files to Solr. I don't know  
>> anything
>> about Java programming, being a system admin. So far, I have read  
>> that the
>> autodetect parser in Tika will somehow detect the file type and I  
>> can use
>> the stream to populate Solr. How, that is still a mystery to me -  
>> working on
>> it. Any tips appreciated; thanks in advance.
>>
>> Sesh
>>
>>
>>
>> On 13 November 2010 15:24, Govind Kanshi<go...@gmail.com>   
>> wrote:
>>
>>
>>> Another pov you might want to think about - what kind of search  
>>> you want.
>>> Just plain - full text search or there is something more to those  
>>> text
>>> files. Are they grouped in folders? Do the folders imply certain  
>>> kind of
>>> grouping/hierarchy/tagging?
>>>
>>> I recently was trying to help somebody who had files across lot of  
>>> places
>>> grouped by date/subject/author - he wanted to ensure these are  
>>> "fields"
>>> which too can act as filters/navigators.
>>>
>>> Just an input - ignore it if you just want plain full text search.
>>>
>>> On Sat, Nov 13, 2010 at 11:25 AM, Lance  
>>> Norskog<go...@gmail.com>  wrote:
>>>
>>>
>>>> About web servers: Solr is a servlet war file and needs a Java  
>>>> web server
>>>> "container" to run. The example/ folder in the Solr disribution  
>>>> uses
>>>> 'Jetty', and this is fine for small production-quality projects.   
>>>> You can
>>>> just copy the example/ directory somewhere to set up your own  
>>>> running
>>>>
>>> Solr;
>>>
>>>> that's what I always do.
>>>>
>>>> About indexing programs: if you know Unix scripting, it may be  
>>>> easiest to
>>>> walk the file system yourself with the 'find' program and create  
>>>> Solr
>>>>
>>> input
>>>
>>>> XML files.
>>>>
>>>> But yes, you definitely want the Solr 1.4 Enterprise manual. I  
>>>> spent
>>>>
>>> months
>>>
>>>> learning this stuff very slowly, and the book would have been  
>>>> great back
>>>> then.
>>>>
>>>> Lance
>>>>
>>>>
>>>> Erick Erickson wrote:
>>>>
>>>>
>>>>> Think of the data import handler (DIH) as Solr pulling data to  
>>>>> index
>>>>> from some source based on configuration. So, once you set up
>>>>> your DIH config to point to your file system, you issue a command
>>>>> to solr like "OK, do your data import thing". See the
>>>>> FileListEntityProcessor.
>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>>
>>>>> <http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent  
>>>>> library
>>>>> you'd use to push data to Solr. Basically, you
>>>>> write a Java program that uses SolrJ to walk the file system, find
>>>>> documents, create a Solr document and sent that to Solr. It's not
>>>>> nearly as complex as it sounds<G>. See:
>>>>> http://wiki.apache.org/solr/Solrj
>>>>>
>>>>> <http://wiki.apache.org/solr/Solrj>It's probably worth your  
>>>>> while to
>>>>>
>>> get
>>>
>>>>> a
>>>>> copy of "Solr 1.4, Enterprise Search Server"
>>>>> by Erik Pugh and David Smiley.
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer<seshadribpl@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Lance,
>>>>>>
>>>>>> Thank you very much for responding (not sure how I reply to the  
>>>>>> group,
>>>>>> so,
>>>>>> writing to you).
>>>>>>
>>>>>> Can you please expand on your suggestion? I am not a web guy  
>>>>>> and so,
>>>>>> don't
>>>>>> know where to start.
>>>>>>
>>>>>> What is the difference between SolrJ and DataImportHandler? Do  
>>>>>> I need
>>>>>>
>>> to
>>>
>>>>>> set
>>>>>> up web servers on all my storage boxes?
>>>>>>
>>>>>> Apologies for the basic level of questions, but hope I can get  
>>>>>> started
>>>>>> and
>>>>>> implement this before the year end (you know why :o)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Sesh
>>>>>>
>>>>>> On 12 November 2010 13:31, Lance Norskog<go...@gmail.com>    
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Using 'curl' is fine. There is a library called SolrJ for Java  
>>>>>>> and
>>>>>>> other libraries for other scripting languages that let you  
>>>>>>> upload with
>>>>>>> more control. There is a thing in Solr called the  
>>>>>>> DataImportHandler
>>>>>>> that lets you script walking a file system.
>>>>>>>
>>>>>>> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<
>>>>>>>
>>> seshadribpl@gmail.com
>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Pardon me if this sounds very elementary, but I have a very  
>>>>>>>> basic
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> question
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> regarding Solr search. I have about 10 storage devices running
>>>>>>>>
>>> Solaris
>>>
>>>>>>>>
>>>>>>>>
>>>>>>> with
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> hundreds of thousands of text files (there are other files,  
>>>>>>>> as well,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> but
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> my
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> target is these text files). The directories on the Solaris  
>>>>>>>> boxes are
>>>>>>>> exported and are available as NFS mounts.
>>>>>>>>
>>>>>>>> I have installed Solr 1.4 on a Linux box and have tested the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> installation,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> using curl to post  documents. However, the manual says that  
>>>>>>>> curl is
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> not
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> recommended way of posting documents to Solr. Could someone  
>>>>>>>> please
>>>>>>>>
>>> tell
>>>
>>>>>>>>
>>>>>>>>
>>>>>>> me
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> what is the preferred approach in such an environment? I am  
>>>>>>>> not a
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> programmer
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> and would appreciate some hand-holding here :o)
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>>
>>>>>>>> Sesh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: A Newbie Question

Posted by Lance Norskog <go...@gmail.com>.

Yes, the ExtractingRequestHandler uses Tika to parse many file formats.

Solr 1.4.1 uses a previous version of Tika (0.6 or 0.7).

Here's the problem with Tika and extraction utilities in general: they 
are not perfect. They will fail on some files. In the 
ExtractingRequestHandler's case, there is no way to let it fail in 
parsing but save the document's metadata anyway with a notation: "sorry 
not parsed".  I would rather have the unix 'strings' command parse my 
documents (thanks to a co-worker for this).

K. Seshadri Iyer wrote:
> Thanks for all the responses.
>
> Govind: To answer your question, yes, all I want to search is plain text
> files. They are located in NFS directories across multiple Solaris/Linux
> storage boxes. The total storage is in hundreds of terabytes.
>
> I have just got started with Solr and my understanding is that I will
> somehow need Tika to help stream/upload files to Solr. I don't know anything
> about Java programming, being a system admin. So far, I have read that the
> autodetect parser in Tika will somehow detect the file type and I can use
> the stream to populate Solr. How, that is still a mystery to me - working on
> it. Any tips appreciated; thanks in advance.
>
> Sesh
>
>
>
> On 13 November 2010 15:24, Govind Kanshi<go...@gmail.com>  wrote:
>
>    
>> Another pov you might want to think about - what kind of search you want.
>> Just plain - full text search or there is something more to those text
>> files. Are they grouped in folders? Do the folders imply certain kind of
>> grouping/hierarchy/tagging?
>>
>> I recently was trying to help somebody who had files across lot of places
>> grouped by date/subject/author - he wanted to ensure these are "fields"
>> which too can act as filters/navigators.
>>
>> Just an input - ignore it if you just want plain full text search.
>>
>> On Sat, Nov 13, 2010 at 11:25 AM, Lance Norskog<go...@gmail.com>  wrote:
>>
>>      
>>> About web servers: Solr is a servlet war file and needs a Java web server
>>> "container" to run. The example/ folder in the Solr disribution uses
>>> 'Jetty', and this is fine for small production-quality projects.  You can
>>> just copy the example/ directory somewhere to set up your own running
>>>        
>> Solr;
>>      
>>> that's what I always do.
>>>
>>> About indexing programs: if you know Unix scripting, it may be easiest to
>>> walk the file system yourself with the 'find' program and create Solr
>>>        
>> input
>>      
>>> XML files.
>>>
>>> But yes, you definitely want the Solr 1.4 Enterprise manual. I spent
>>>        
>> months
>>      
>>> learning this stuff very slowly, and the book would have been great back
>>> then.
>>>
>>> Lance
>>>
>>>
>>> Erick Erickson wrote:
>>>
>>>        
>>>> Think of the data import handler (DIH) as Solr pulling data to index
>>>> from some source based on configuration. So, once you set up
>>>> your DIH config to point to your file system, you issue a command
>>>> to solr like "OK, do your data import thing". See the
>>>> FileListEntityProcessor.
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>
>>>> <http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent library
>>>> you'd use to push data to Solr. Basically, you
>>>> write a Java program that uses SolrJ to walk the file system, find
>>>> documents, create a Solr document and sent that to Solr. It's not
>>>> nearly as complex as it sounds<G>. See:
>>>> http://wiki.apache.org/solr/Solrj
>>>>
>>>> <http://wiki.apache.org/solr/Solrj>It's probably worth your while to
>>>>          
>> get
>>      
>>>> a
>>>> copy of "Solr 1.4, Enterprise Search Server"
>>>> by Erik Pugh and David Smiley.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer<seshadribpl@gmail.com
>>>>          
>>>>> wrote:
>>>>>            
>>>>
>>>>
>>>>          
>>>>> Hi Lance,
>>>>>
>>>>> Thank you very much for responding (not sure how I reply to the group,
>>>>> so,
>>>>> writing to you).
>>>>>
>>>>> Can you please expand on your suggestion? I am not a web guy and so,
>>>>> don't
>>>>> know where to start.
>>>>>
>>>>> What is the difference between SolrJ and DataImportHandler? Do I need
>>>>>            
>> to
>>      
>>>>> set
>>>>> up web servers on all my storage boxes?
>>>>>
>>>>> Apologies for the basic level of questions, but hope I can get started
>>>>> and
>>>>> implement this before the year end (you know why :o)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Sesh
>>>>>
>>>>> On 12 November 2010 13:31, Lance Norskog<go...@gmail.com>   wrote:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Using 'curl' is fine. There is a library called SolrJ for Java and
>>>>>> other libraries for other scripting languages that let you upload with
>>>>>> more control. There is a thing in Solr called the DataImportHandler
>>>>>> that lets you script walking a file system.
>>>>>>
>>>>>> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<
>>>>>>              
>> seshadribpl@gmail.com
>>      
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Hi,
>>>>>>>
>>>>>>> Pardon me if this sounds very elementary, but I have a very basic
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> question
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> regarding Solr search. I have about 10 storage devices running
>>>>>>>                
>> Solaris
>>      
>>>>>>>
>>>>>>>                
>>>>>> with
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> hundreds of thousands of text files (there are other files, as well,
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> but
>>>>>>              
>>>>>
>>>>>            
>>>>>> my
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> target is these text files). The directories on the Solaris boxes are
>>>>>>> exported and are available as NFS mounts.
>>>>>>>
>>>>>>> I have installed Solr 1.4 on a Linux box and have tested the
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> installation,
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> using curl to post  documents. However, the manual says that curl is
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> not
>>>>>>              
>>>>>
>>>>>            
>>>>>> the
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> recommended way of posting documents to Solr. Could someone please
>>>>>>>                
>> tell
>>      
>>>>>>>
>>>>>>>                
>>>>>> me
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> what is the preferred approach in such an environment? I am not a
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> programmer
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> and would appreciate some hand-holding here :o)
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> Sesh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>            
>>>>
>>>>          
>>>        
>>      
>

Re: A Newbie Question

Posted by "K. Seshadri Iyer" <se...@gmail.com>.

Thanks for all the responses.

Govind: To answer your question, yes, all I want to search is plain text
files. They are located in NFS directories across multiple Solaris/Linux
storage boxes. The total storage is in hundreds of terabytes.

I have just got started with Solr and my understanding is that I will
somehow need Tika to help stream/upload files to Solr. I don't know anything
about Java programming, being a system admin. So far, I have read that the
autodetect parser in Tika will somehow detect the file type and I can use
the stream to populate Solr. How, that is still a mystery to me - working on
it. Any tips appreciated; thanks in advance.

Sesh



On 13 November 2010 15:24, Govind Kanshi <go...@gmail.com> wrote:

> Another pov you might want to think about - what kind of search you want.
> Just plain - full text search or there is something more to those text
> files. Are they grouped in folders? Do the folders imply certain kind of
> grouping/hierarchy/tagging?
>
> I recently was trying to help somebody who had files across lot of places
> grouped by date/subject/author - he wanted to ensure these are "fields"
> which too can act as filters/navigators.
>
> Just an input - ignore it if you just want plain full text search.
>
> On Sat, Nov 13, 2010 at 11:25 AM, Lance Norskog <go...@gmail.com> wrote:
>
> > About web servers: Solr is a servlet war file and needs a Java web server
> > "container" to run. The example/ folder in the Solr disribution uses
> > 'Jetty', and this is fine for small production-quality projects.  You can
> > just copy the example/ directory somewhere to set up your own running
> Solr;
> > that's what I always do.
> >
> > About indexing programs: if you know Unix scripting, it may be easiest to
> > walk the file system yourself with the 'find' program and create Solr
> input
> > XML files.
> >
> > But yes, you definitely want the Solr 1.4 Enterprise manual. I spent
> months
> > learning this stuff very slowly, and the book would have been great back
> > then.
> >
> > Lance
> >
> >
> > Erick Erickson wrote:
> >
> >> Think of the data import handler (DIH) as Solr pulling data to index
> >> from some source based on configuration. So, once you set up
> >> your DIH config to point to your file system, you issue a command
> >> to solr like "OK, do your data import thing". See the
> >> FileListEntityProcessor.
> >> http://wiki.apache.org/solr/DataImportHandler
> >>
> >> <http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent library
> >> you'd use to push data to Solr. Basically, you
> >> write a Java program that uses SolrJ to walk the file system, find
> >> documents, create a Solr document and sent that to Solr. It's not
> >> nearly as complex as it sounds<G>. See:
> >> http://wiki.apache.org/solr/Solrj
> >>
> >> <http://wiki.apache.org/solr/Solrj>It's probably worth your while to
> get
> >> a
> >> copy of "Solr 1.4, Enterprise Search Server"
> >> by Erik Pugh and David Smiley.
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer<seshadribpl@gmail.com
> >> >wrote:
> >>
> >>
> >>
> >>> Hi Lance,
> >>>
> >>> Thank you very much for responding (not sure how I reply to the group,
> >>> so,
> >>> writing to you).
> >>>
> >>> Can you please expand on your suggestion? I am not a web guy and so,
> >>> don't
> >>> know where to start.
> >>>
> >>> What is the difference between SolrJ and DataImportHandler? Do I need
> to
> >>> set
> >>> up web servers on all my storage boxes?
> >>>
> >>> Apologies for the basic level of questions, but hope I can get started
> >>> and
> >>> implement this before the year end (you know why :o)
> >>>
> >>> Thanks,
> >>>
> >>> Sesh
> >>>
> >>> On 12 November 2010 13:31, Lance Norskog<go...@gmail.com>  wrote:
> >>>
> >>>
> >>>
> >>>> Using 'curl' is fine. There is a library called SolrJ for Java and
> >>>> other libraries for other scripting languages that let you upload with
> >>>> more control. There is a thing in Solr called the DataImportHandler
> >>>> that lets you script walking a file system.
> >>>>
> >>>> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<
> seshadribpl@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Pardon me if this sounds very elementary, but I have a very basic
> >>>>>
> >>>>>
> >>>> question
> >>>>
> >>>>
> >>>>> regarding Solr search. I have about 10 storage devices running
> Solaris
> >>>>>
> >>>>>
> >>>> with
> >>>>
> >>>>
> >>>>> hundreds of thousands of text files (there are other files, as well,
> >>>>>
> >>>>>
> >>>> but
> >>>
> >>>
> >>>> my
> >>>>
> >>>>
> >>>>> target is these text files). The directories on the Solaris boxes are
> >>>>> exported and are available as NFS mounts.
> >>>>>
> >>>>> I have installed Solr 1.4 on a Linux box and have tested the
> >>>>>
> >>>>>
> >>>> installation,
> >>>>
> >>>>
> >>>>> using curl to post  documents. However, the manual says that curl is
> >>>>>
> >>>>>
> >>>> not
> >>>
> >>>
> >>>> the
> >>>>
> >>>>
> >>>>> recommended way of posting documents to Solr. Could someone please
> tell
> >>>>>
> >>>>>
> >>>> me
> >>>>
> >>>>
> >>>>> what is the preferred approach in such an environment? I am not a
> >>>>>
> >>>>>
> >>>> programmer
> >>>>
> >>>>
> >>>>> and would appreciate some hand-holding here :o)
> >>>>>
> >>>>> Thanks in advance,
> >>>>>
> >>>>> Sesh
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goksron@gmail.com
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
>

Re: A Newbie Question

Posted by Govind Kanshi <go...@gmail.com>.

Another pov you might want to think about - what kind of search you want.
Just plain - full text search or there is something more to those text
files. Are they grouped in folders? Do the folders imply certain kind of
grouping/hierarchy/tagging?

I recently was trying to help somebody who had files across lot of places
grouped by date/subject/author - he wanted to ensure these are "fields"
which too can act as filters/navigators.

Just an input - ignore it if you just want plain full text search.

On Sat, Nov 13, 2010 at 11:25 AM, Lance Norskog <go...@gmail.com> wrote:

> About web servers: Solr is a servlet war file and needs a Java web server
> "container" to run. The example/ folder in the Solr disribution uses
> 'Jetty', and this is fine for small production-quality projects.  You can
> just copy the example/ directory somewhere to set up your own running Solr;
> that's what I always do.
>
> About indexing programs: if you know Unix scripting, it may be easiest to
> walk the file system yourself with the 'find' program and create Solr input
> XML files.
>
> But yes, you definitely want the Solr 1.4 Enterprise manual. I spent months
> learning this stuff very slowly, and the book would have been great back
> then.
>
> Lance
>
>
> Erick Erickson wrote:
>
>> Think of the data import handler (DIH) as Solr pulling data to index
>> from some source based on configuration. So, once you set up
>> your DIH config to point to your file system, you issue a command
>> to solr like "OK, do your data import thing". See the
>> FileListEntityProcessor.
>> http://wiki.apache.org/solr/DataImportHandler
>>
>> <http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent library
>> you'd use to push data to Solr. Basically, you
>> write a Java program that uses SolrJ to walk the file system, find
>> documents, create a Solr document and sent that to Solr. It's not
>> nearly as complex as it sounds<G>. See:
>> http://wiki.apache.org/solr/Solrj
>>
>> <http://wiki.apache.org/solr/Solrj>It's probably worth your while to get
>> a
>> copy of "Solr 1.4, Enterprise Search Server"
>> by Erik Pugh and David Smiley.
>>
>> Best
>> Erick
>>
>> On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer<seshadribpl@gmail.com
>> >wrote:
>>
>>
>>
>>> Hi Lance,
>>>
>>> Thank you very much for responding (not sure how I reply to the group,
>>> so,
>>> writing to you).
>>>
>>> Can you please expand on your suggestion? I am not a web guy and so,
>>> don't
>>> know where to start.
>>>
>>> What is the difference between SolrJ and DataImportHandler? Do I need to
>>> set
>>> up web servers on all my storage boxes?
>>>
>>> Apologies for the basic level of questions, but hope I can get started
>>> and
>>> implement this before the year end (you know why :o)
>>>
>>> Thanks,
>>>
>>> Sesh
>>>
>>> On 12 November 2010 13:31, Lance Norskog<go...@gmail.com>  wrote:
>>>
>>>
>>>
>>>> Using 'curl' is fine. There is a library called SolrJ for Java and
>>>> other libraries for other scripting languages that let you upload with
>>>> more control. There is a thing in Solr called the DataImportHandler
>>>> that lets you script walking a file system.
>>>>
>>>> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<seshadribpl@gmail.com
>>>>
>>>> wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Pardon me if this sounds very elementary, but I have a very basic
>>>>>
>>>>>
>>>> question
>>>>
>>>>
>>>>> regarding Solr search. I have about 10 storage devices running Solaris
>>>>>
>>>>>
>>>> with
>>>>
>>>>
>>>>> hundreds of thousands of text files (there are other files, as well,
>>>>>
>>>>>
>>>> but
>>>
>>>
>>>> my
>>>>
>>>>
>>>>> target is these text files). The directories on the Solaris boxes are
>>>>> exported and are available as NFS mounts.
>>>>>
>>>>> I have installed Solr 1.4 on a Linux box and have tested the
>>>>>
>>>>>
>>>> installation,
>>>>
>>>>
>>>>> using curl to post  documents. However, the manual says that curl is
>>>>>
>>>>>
>>>> not
>>>
>>>
>>>> the
>>>>
>>>>
>>>>> recommended way of posting documents to Solr. Could someone please tell
>>>>>
>>>>>
>>>> me
>>>>
>>>>
>>>>> what is the preferred approach in such an environment? I am not a
>>>>>
>>>>>
>>>> programmer
>>>>
>>>>
>>>>> and would appreciate some hand-holding here :o)
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Sesh
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: A Newbie Question

Posted by Lance Norskog <go...@gmail.com>.

About web servers: Solr is a servlet war file and needs a Java web 
server "container" to run. The example/ folder in the Solr disribution 
uses 'Jetty', and this is fine for small production-quality projects.  
You can just copy the example/ directory somewhere to set up your own 
running Solr; that's what I always do.

About indexing programs: if you know Unix scripting, it may be easiest 
to walk the file system yourself with the 'find' program and create Solr 
input XML files.

But yes, you definitely want the Solr 1.4 Enterprise manual. I spent 
months learning this stuff very slowly, and the book would have been 
great back then.

Lance

Erick Erickson wrote:
> Think of the data import handler (DIH) as Solr pulling data to index
> from some source based on configuration. So, once you set up
> your DIH config to point to your file system, you issue a command
> to solr like "OK, do your data import thing". See the
> FileListEntityProcessor.
> http://wiki.apache.org/solr/DataImportHandler
>
> <http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent library
> you'd use to push data to Solr. Basically, you
> write a Java program that uses SolrJ to walk the file system, find
> documents, create a Solr document and sent that to Solr. It's not
> nearly as complex as it sounds<G>. See:
> http://wiki.apache.org/solr/Solrj
>
> <http://wiki.apache.org/solr/Solrj>It's probably worth your while to get a
> copy of "Solr 1.4, Enterprise Search Server"
> by Erik Pugh and David Smiley.
>
> Best
> Erick
>
> On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer<se...@gmail.com>wrote:
>
>    
>> Hi Lance,
>>
>> Thank you very much for responding (not sure how I reply to the group, so,
>> writing to you).
>>
>> Can you please expand on your suggestion? I am not a web guy and so, don't
>> know where to start.
>>
>> What is the difference between SolrJ and DataImportHandler? Do I need to
>> set
>> up web servers on all my storage boxes?
>>
>> Apologies for the basic level of questions, but hope I can get started and
>> implement this before the year end (you know why :o)
>>
>> Thanks,
>>
>> Sesh
>>
>> On 12 November 2010 13:31, Lance Norskog<go...@gmail.com>  wrote:
>>
>>      
>>> Using 'curl' is fine. There is a library called SolrJ for Java and
>>> other libraries for other scripting languages that let you upload with
>>> more control. There is a thing in Solr called the DataImportHandler
>>> that lets you script walking a file system.
>>>
>>> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer<seshadribpl@gmail.com
>>>
>>> wrote:
>>>        
>>>> Hi,
>>>>
>>>> Pardon me if this sounds very elementary, but I have a very basic
>>>>          
>>> question
>>>        
>>>> regarding Solr search. I have about 10 storage devices running Solaris
>>>>          
>>> with
>>>        
>>>> hundreds of thousands of text files (there are other files, as well,
>>>>          
>> but
>>      
>>> my
>>>        
>>>> target is these text files). The directories on the Solaris boxes are
>>>> exported and are available as NFS mounts.
>>>>
>>>> I have installed Solr 1.4 on a Linux box and have tested the
>>>>          
>>> installation,
>>>        
>>>> using curl to post  documents. However, the manual says that curl is
>>>>          
>> not
>>      
>>> the
>>>        
>>>> recommended way of posting documents to Solr. Could someone please tell
>>>>          
>>> me
>>>        
>>>> what is the preferred approach in such an environment? I am not a
>>>>          
>>> programmer
>>>        
>>>> and would appreciate some hand-holding here :o)
>>>>
>>>> Thanks in advance,
>>>>
>>>> Sesh
>>>>
>>>>          
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>>        
>>      
>

Re: A Newbie Question

Posted by Erick Erickson <er...@gmail.com>.

Think of the data import handler (DIH) as Solr pulling data to index
from some source based on configuration. So, once you set up
your DIH config to point to your file system, you issue a command
to solr like "OK, do your data import thing". See the
FileListEntityProcessor.
http://wiki.apache.org/solr/DataImportHandler

<http://wiki.apache.org/solr/DataImportHandler>SolrJ is a clent library
you'd use to push data to Solr. Basically, you
write a Java program that uses SolrJ to walk the file system, find
documents, create a Solr document and sent that to Solr. It's not
nearly as complex as it sounds <G>. See:
http://wiki.apache.org/solr/Solrj

<http://wiki.apache.org/solr/Solrj>It's probably worth your while to get a
copy of "Solr 1.4, Enterprise Search Server"
by Erik Pugh and David Smiley.

Best
Erick

On Fri, Nov 12, 2010 at 8:37 AM, K. Seshadri Iyer <se...@gmail.com>wrote:

> Hi Lance,
>
> Thank you very much for responding (not sure how I reply to the group, so,
> writing to you).
>
> Can you please expand on your suggestion? I am not a web guy and so, don't
> know where to start.
>
> What is the difference between SolrJ and DataImportHandler? Do I need to
> set
> up web servers on all my storage boxes?
>
> Apologies for the basic level of questions, but hope I can get started and
> implement this before the year end (you know why :o)
>
> Thanks,
>
> Sesh
>
> On 12 November 2010 13:31, Lance Norskog <go...@gmail.com> wrote:
>
> > Using 'curl' is fine. There is a library called SolrJ for Java and
> > other libraries for other scripting languages that let you upload with
> > more control. There is a thing in Solr called the DataImportHandler
> > that lets you script walking a file system.
> >
> > On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer <seshadribpl@gmail.com
> >
> > wrote:
> > > Hi,
> > >
> > > Pardon me if this sounds very elementary, but I have a very basic
> > question
> > > regarding Solr search. I have about 10 storage devices running Solaris
> > with
> > > hundreds of thousands of text files (there are other files, as well,
> but
> > my
> > > target is these text files). The directories on the Solaris boxes are
> > > exported and are available as NFS mounts.
> > >
> > > I have installed Solr 1.4 on a Linux box and have tested the
> > installation,
> > > using curl to post  documents. However, the manual says that curl is
> not
> > the
> > > recommended way of posting documents to Solr. Could someone please tell
> > me
> > > what is the preferred approach in such an environment? I am not a
> > programmer
> > > and would appreciate some hand-holding here :o)
> > >
> > > Thanks in advance,
> > >
> > > Sesh
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: A Newbie Question

Posted by "K. Seshadri Iyer" <se...@gmail.com>.

Hi Lance,

Thank you very much for responding (not sure how I reply to the group, so,
writing to you).

Can you please expand on your suggestion? I am not a web guy and so, don't
know where to start.

What is the difference between SolrJ and DataImportHandler? Do I need to set
up web servers on all my storage boxes?

Apologies for the basic level of questions, but hope I can get started and
implement this before the year end (you know why :o)

Thanks,

Sesh

On 12 November 2010 13:31, Lance Norskog <go...@gmail.com> wrote:

> Using 'curl' is fine. There is a library called SolrJ for Java and
> other libraries for other scripting languages that let you upload with
> more control. There is a thing in Solr called the DataImportHandler
> that lets you script walking a file system.
>
> On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer <se...@gmail.com>
> wrote:
> > Hi,
> >
> > Pardon me if this sounds very elementary, but I have a very basic
> question
> > regarding Solr search. I have about 10 storage devices running Solaris
> with
> > hundreds of thousands of text files (there are other files, as well, but
> my
> > target is these text files). The directories on the Solaris boxes are
> > exported and are available as NFS mounts.
> >
> > I have installed Solr 1.4 on a Linux box and have tested the
> installation,
> > using curl to post  documents. However, the manual says that curl is not
> the
> > recommended way of posting documents to Solr. Could someone please tell
> me
> > what is the preferred approach in such an environment? I am not a
> programmer
> > and would appreciate some hand-holding here :o)
> >
> > Thanks in advance,
> >
> > Sesh
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: A Newbie Question

Posted by Lance Norskog <go...@gmail.com>.

Using 'curl' is fine. There is a library called SolrJ for Java and
other libraries for other scripting languages that let you upload with
more control. There is a thing in Solr called the DataImportHandler
that lets you script walking a file system.

On Thu, Nov 11, 2010 at 8:38 PM, K. Seshadri Iyer <se...@gmail.com> wrote:
> Hi,
>
> Pardon me if this sounds very elementary, but I have a very basic question
> regarding Solr search. I have about 10 storage devices running Solaris with
> hundreds of thousands of text files (there are other files, as well, but my
> target is these text files). The directories on the Solaris boxes are
> exported and are available as NFS mounts.
>
> I have installed Solr 1.4 on a Linux box and have tested the installation,
> using curl to post  documents. However, the manual says that curl is not the
> recommended way of posting documents to Solr. Could someone please tell me
> what is the preferred approach in such an environment? I am not a programmer
> and would appreciate some hand-holding here :o)
>
> Thanks in advance,
>
> Sesh
>



-- 
Lance Norskog
goksron@gmail.com