You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Nigel V Thomas <em...@gmail.com> on 2013/06/27 01:47:33 UTC

Re: Indexing file with security problem

Hello Lucas,

Firstly, I am moving this discussion from the dev list to the java users
mailing list, please refer to the guidelines on where to post such
discussions http://lucene.apache.org/core/discussion.html

A couple of observations from your problem, I think the task can be broken
down into a key problems:

1* Indexing content from multiple datasource and building a single record,
this should be possible
2* Applying access control to the search index,
perhaps synchronising security policies stored externally maybe fetching
roles information from active directory, and other properties from the file
system, again this should be possible

For these two tasks, I would recommend taking a look at Apache
ManifoldCF<http://manifoldcf.apache.org/en_US/index.html#What+Is+Apache+ManifoldCF%3F>
(MCF)
at http://manifoldcf.apache.org/en_US/index.html, MCF provides connector
framework for indexing content from external data sources, it also gives
you way to sych the data at some pre set interval. Similarly MCF can handle
certain security requirements, supports document level access control, and
integrates with several existing user identity and security policy sources
such as Active Directory. Although note that this framework adopts a
polling model, to fetch both data and security properties from respective
data sources. If you need dynamic near real time updates, you may need to
consider coding it yourself. It should be possible to extend the MCF
framework to support these requirements, worth asking the same question at
MCF mailing lists http://manifoldcf.apache.org/en_US/mail.html

If you find that MCF is not suitable, and are interested in exploring the
security and real time update requirements further, let me know, I may be
able to point you to some further references.

3* Indexing encrypted files, would transfer the additional responsibility
of maintaining the confidentially of the document to your index, so you may
need an encrypted index and ensure privacy is preserved for both index and
queries sent the search system, this is a complex and I am not sure of any
suitable solutions yet.

Nigel V Thomas

On 26 June 2013 20:42, lukasw <lu...@gmail.com> wrote:

> Hello
>
> I'll try to briefly describe my problem and task.
> My name is Lukas and i am Java developer , my task is to create search
> engine for different types of file (only text file types) pdf, word, odf,
> xml but not html.
> I have got little experience with lucene about year ago i wrote simple full
> text search using lucene and hibernate search. That was simple project. But
> now i have got very difficult task with searching.
> We are using java 1.7 and glassfish 3 and i have to concentrate only server
> side approach not client ui. Ther is my three major problem :
>
> 1) All files is stored on webdav server, but information about file name ,
> id file typ etc are stored into database (postgresql) so when i creating
> index i need to use both information. As a result of query i need only
> return file id from database. Summary content of file is stored in server
> but information about file is stored in database so we must retrieve both.
>
> 2) Secondary problem it that  each file has a level of secrecy. But major
> problem is that this level is calculated dynamically. When calculating
> level
> of security for file we considering several properties. The static
> properties is files location, the folder in which the file is, but also
> dynamic  information  user profiles user roles and departments . So when
> user "Maggie" is logged she can search only files "test.pdf" , "test2.doc"
> etc but if user "Stev" is logged he have got different profiles such a
> Maggie so he can only search some phase in file "broken.pdf", "mybook.odt".
> test2.doc etc ..... . I think that when for example user search phase
> "lucene +solr" we search in all indexed documents and after that filtered
> result. But i think that solution is  is not very efficient. What if
> results
> count 100 files , so what next we filtered step by step each files  ? But i
> do not see any other solution. Maybe you can help me and lucene or solr
> have
> got mechanism to help.
>
> 3) Last problem is that some files are encrypted. So that files must be
> indexed only once before encryption ! But i think that if we indexed secure
> files so we get security issue. Because all word from that file is
> tokenized.
> I have not got any idea haw to secure lucene documents and index datastore
> ?
> its possible ...
>
>
> Also i have got question that i need to use Solr for my serarch engine or
> using only lucene and write own search engine ? So as you can see i have
> not
> got problem with indexing , serching but with security files and files
> secured levels.
>
> Thanks for any hints and time you spend for me.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-file-with-security-problem-tp4073394.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>