You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by pankaj bhatt <pa...@gmail.com> on 2010/12/10 12:24:13 UTC

Indexing documents with SOLR

Hi All,
      I am a newbie to SOLR and trying to integrate TIKA + SOLR.
  Can anyone please guide me, how to achieve this.

* My Req is:* I have a directory containing a lot of PDF,DOC's and i need to
make a search within the documents. I am using SOLR web application.

           I just need some sample xml code both for solr-config.xml and the
directory-schema.xml
        Awaiting eagerly for your response.

Regards,
Pankaj Bhatt.

Re: Indexing documents with SOLR

Posted by Adam Estrada <es...@gmail.com>.
Pankaj,

Check this article out on how to get going with Nutch.
http://bit.ly/dbBdK4This is a few months old so you will have to note
that there is a new
parameter called something like -SolrUrl that will allow you to update your
solr index with the crawled data.

For crawling your local file system, you will have to change the http:// to
file:// in your seed.txt file to point to the directory you want to crawl.
Another VERY important option is to increase your Java heap size. I do this
by using the JAVA_OPT environment variable.

Adam

On Sat, Dec 11, 2010 at 8:27 AM, pankaj bhatt <pa...@gmail.com> wrote:

> Hi Adam,
>    Thanks a lot for pointing me out to NUTCH.
>    Can you please tell me, is through NUTCH Can I read teh directory on
> local system or on a shared file system.
>
>   Will wait for your response.
>
> / Pankaj Bhatt
>
>
> On Fri, Dec 10, 2010 at 9:35 PM, Adam Estrada <es...@gmail.com>wrote:
>
>> Nutch is also a great option if you want a crawler. I have found that you
>> will need to use the latest version of PDFBox and a it's dependencies for
>> better results. Also, make sure to set JAVA_OPT to something really large
>> so
>> that you won't exceed your heap size.
>>
>> Adam
>>
>> On Fri, Dec 10, 2010 at 6:27 AM, Tommaso Teofili
>> <to...@gmail.com>wrote:
>>
>> > Hi Pankaj,
>> > you can find the needed documentation right here [1].
>> > Hope this helps,
>> > Tommaso
>> >
>> > [1] : http://wiki.apache.org/solr/ExtractingRequestHandler
>> >
>> > 2010/12/10 pankaj bhatt <pa...@gmail.com>
>> >
>> > > Hi All,
>> > >      I am a newbie to SOLR and trying to integrate TIKA + SOLR.
>> > >  Can anyone please guide me, how to achieve this.
>> > >
>> > > * My Req is:* I have a directory containing a lot of PDF,DOC's and i
>> need
>> > > to
>> > > make a search within the documents. I am using SOLR web application.
>> > >
>> > >           I just need some sample xml code both for solr-config.xml
>> and
>> > the
>> > > directory-schema.xml
>> > >        Awaiting eagerly for your response.
>> > >
>> > > Regards,
>> > > Pankaj Bhatt.
>> > >
>> >
>>
>
>

Re: Indexing documents with SOLR

Posted by Adam Estrada <es...@gmail.com>.
Nutch is also a great option if you want a crawler. I have found that you
will need to use the latest version of PDFBox and a it's dependencies for
better results. Also, make sure to set JAVA_OPT to something really large so
that you won't exceed your heap size.

Adam

On Fri, Dec 10, 2010 at 6:27 AM, Tommaso Teofili
<to...@gmail.com>wrote:

> Hi Pankaj,
> you can find the needed documentation right here [1].
> Hope this helps,
> Tommaso
>
> [1] : http://wiki.apache.org/solr/ExtractingRequestHandler
>
> 2010/12/10 pankaj bhatt <pa...@gmail.com>
>
> > Hi All,
> >      I am a newbie to SOLR and trying to integrate TIKA + SOLR.
> >  Can anyone please guide me, how to achieve this.
> >
> > * My Req is:* I have a directory containing a lot of PDF,DOC's and i need
> > to
> > make a search within the documents. I am using SOLR web application.
> >
> >           I just need some sample xml code both for solr-config.xml and
> the
> > directory-schema.xml
> >        Awaiting eagerly for your response.
> >
> > Regards,
> > Pankaj Bhatt.
> >
>

Re: Indexing documents with SOLR

Posted by Tommaso Teofili <to...@gmail.com>.
Hi Pankaj,
you can find the needed documentation right here [1].
Hope this helps,
Tommaso

[1] : http://wiki.apache.org/solr/ExtractingRequestHandler

2010/12/10 pankaj bhatt <pa...@gmail.com>

> Hi All,
>      I am a newbie to SOLR and trying to integrate TIKA + SOLR.
>  Can anyone please guide me, how to achieve this.
>
> * My Req is:* I have a directory containing a lot of PDF,DOC's and i need
> to
> make a search within the documents. I am using SOLR web application.
>
>           I just need some sample xml code both for solr-config.xml and the
> directory-schema.xml
>        Awaiting eagerly for your response.
>
> Regards,
> Pankaj Bhatt.
>