You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Darx Oman <da...@gmail.com> on 2010/11/23 09:35:23 UTC

Basic Solr Configurations and best practice

Hi guys

I'm kind of new to solr and I'm wondering how to configure solr to best
fulfills my requirements.

Requirements are as follow:

I have 2 data sources: database and file system documents. Every document in
the file system has related information stored in the database.  Both the
file content and the related database fields must be indexed.  Along with
the DB data is per-user permissions for every document.  I'm using DIH for
the DB and Tika for the file System.  The documents contents nearly never
change, while the DB data especially the permissions changes very
frequently. Total number of documents roughly around 2M and each document is
about 500KB.

1-      How to combine data from DIH and content extracted from file system
document into one document in the index?

2-      Should I move the per-user permissions into a separate index? What
technique to implement?

Re: Basic Solr Configurations and best practice

Posted by Lance Norskog <go...@gmail.com>.
Solr 4- You mean the Solr 'trunk' source or the Solr 1.4.1 release?

The 1.4.1 release does not have the TikaEntityProcessor, only the /extract code.

The Solr 3.x branch and the trunk have the TikaEP. I use the 3.x
branch and, well, the TikaEP has a few problems but can be hacked
around.

Whatever version of Tika is in the Solr release, it will only work
with that Tika.

Lance

On Sun, Nov 28, 2010 at 10:33 PM, Darx Oman <da...@gmail.com> wrote:
> thanx Alexey
> I downloaded Solr 4 and implemented the TikaEntityProcessor, it worked fine
> with Tika 0.6.
> didn't work with Tika 0.7 nor Tika 0.8 SNAPSHOT
>
>
> On Sat, Nov 27, 2010 at 4:05 AM, Alexey Serba <as...@gmail.com> wrote:
>
>> > 1-      How to combine data from DIH and content extracted from file
>> system
>> > document into one document in the index?
>> http://wiki.apache.org/solr/TikaEntityProcessor
>> You can have one sql entity that retrieves metadata from database and
>> another nested entity that parses binary file into additional fields
>> in the document.
>>
>> > 2-      Should I move the per-user permissions into a separate index?
>> What
>> > technique to implement?
>> I would start with keeping permissions in the same index as the actual
>> content.
>>
>>
>> On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman <da...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I'm kind of new to solr and I'm wondering how to configure solr to best
>> > fulfills my requirements.
>> >
>> > Requirements are as follow:
>> >
>> > I have 2 data sources: database and file system documents. Every document
>> in
>> > the file system has related information stored in the database.  Both the
>> > file content and the related database fields must be indexed.  Along with
>> > the DB data is per-user permissions for every document.  I'm using DIH
>> for
>> > the DB and Tika for the file System.  The documents contents nearly never
>> > change, while the DB data especially the permissions changes very
>> > frequently. Total number of documents roughly around 2M and each document
>> is
>> > about 500KB.
>> >
>> > 1-      How to combine data from DIH and content extracted from file
>> system
>> > document into one document in the index?
>> >
>> > 2-      Should I move the per-user permissions into a separate index?
>> What
>> > technique to implement?
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Basic Solr Configurations and best practice

Posted by Darx Oman <da...@gmail.com>.
thanx Alexey
I downloaded Solr 4 and implemented the TikaEntityProcessor, it worked fine
with Tika 0.6.
didn't work with Tika 0.7 nor Tika 0.8 SNAPSHOT


On Sat, Nov 27, 2010 at 4:05 AM, Alexey Serba <as...@gmail.com> wrote:

> > 1-      How to combine data from DIH and content extracted from file
> system
> > document into one document in the index?
> http://wiki.apache.org/solr/TikaEntityProcessor
> You can have one sql entity that retrieves metadata from database and
> another nested entity that parses binary file into additional fields
> in the document.
>
> > 2-      Should I move the per-user permissions into a separate index?
> What
> > technique to implement?
> I would start with keeping permissions in the same index as the actual
> content.
>
>
> On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman <da...@gmail.com> wrote:
> > Hi guys
> >
> > I'm kind of new to solr and I'm wondering how to configure solr to best
> > fulfills my requirements.
> >
> > Requirements are as follow:
> >
> > I have 2 data sources: database and file system documents. Every document
> in
> > the file system has related information stored in the database.  Both the
> > file content and the related database fields must be indexed.  Along with
> > the DB data is per-user permissions for every document.  I'm using DIH
> for
> > the DB and Tika for the file System.  The documents contents nearly never
> > change, while the DB data especially the permissions changes very
> > frequently. Total number of documents roughly around 2M and each document
> is
> > about 500KB.
> >
> > 1-      How to combine data from DIH and content extracted from file
> system
> > document into one document in the index?
> >
> > 2-      Should I move the per-user permissions into a separate index?
> What
> > technique to implement?
> >
>

Re: Basic Solr Configurations and best practice

Posted by Alexey Serba <as...@gmail.com>.
> 1-      How to combine data from DIH and content extracted from file system
> document into one document in the index?
http://wiki.apache.org/solr/TikaEntityProcessor
You can have one sql entity that retrieves metadata from database and
another nested entity that parses binary file into additional fields
in the document.

> 2-      Should I move the per-user permissions into a separate index? What
> technique to implement?
I would start with keeping permissions in the same index as the actual content.


On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman <da...@gmail.com> wrote:
> Hi guys
>
> I'm kind of new to solr and I'm wondering how to configure solr to best
> fulfills my requirements.
>
> Requirements are as follow:
>
> I have 2 data sources: database and file system documents. Every document in
> the file system has related information stored in the database.  Both the
> file content and the related database fields must be indexed.  Along with
> the DB data is per-user permissions for every document.  I'm using DIH for
> the DB and Tika for the file System.  The documents contents nearly never
> change, while the DB data especially the permissions changes very
> frequently. Total number of documents roughly around 2M and each document is
> about 500KB.
>
> 1-      How to combine data from DIH and content extracted from file system
> document into one document in the index?
>
> 2-      Should I move the per-user permissions into a separate index? What
> technique to implement?
>