You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Wilhelm Eger <wi...@gmail.com> on 2017/02/22 12:14:27 UTC

Additional information from external database

Hi!

I am using a setup of datafari (www.datafari.com), which more or less combines 
a ManifoldCF file index with SolR as a search engine.

My setup consists of ~350000 files, which are composed mainly of doc(x), 
xls(x), msg and pdf files. pdf files are ocr'd externally before they are added 
to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd on-
the-fly, when being imported.

The files are actually part of an external file management system (files in the 
literal meaning of files, not files in the meaning of entities saved on the hard 
disk), which is not related to ManifoldCF/SolR at all. This system 
unfortunately does not provide a proper full text search, hence I implemented 
it as outlined above.

However, the users are used to certain file numbers provided by this file 
management system. These file numbers are stored in a MSSQL database, which is 
accessible from the host my setup is running on. I can easily get the file 
number by sending a respective SQL statement based on the file name (of the 
entity saved on the hard disk) to the SQL Server. Hence, for each file name, 
there is a file number stored in the database. I would like to have these file 
numbers to be stored in a specific field of the solr index to be shown by the 
(tomcat) output, e.g:

File name: /data/1003234234.docx
Content: "This is the content. You searched for _text_."
File name belongs to file number: SUI-G-25-A

Is there any possibility to achieve that? Did I understand it correctly that 
this could happen either in ManifoldCF during indexing or in SolR during 
importing?

I know that there is a tika plugin to talk to databases, which could be fed 
with a SQL statement. But how to connect it with the data retrieved from the 
files crawler?

Alternatively, I could also call an external script (bash, python) to retrieve 
the respective data from the database using bsqldb.

Any hint in the right direction is very much appreciated.

Thanks in advance,

Wilhelm


Re: Additional information from external database

Posted by Erick Erickson <er...@gmail.com>.
There really isn't a _Tika_ database connector, Tika parses the
structured files. A typical jdbc connector can connect to a DB. You
might be thinking of Data Import Handler (DIH).

Here's a program that both uses Tika and connects to a DB that might
give you a hint. It uses an older version of Solr, but should be
fairly easily modifiable.

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Wed, Feb 22, 2017 at 4:14 AM, Wilhelm Eger <wi...@gmail.com> wrote:
> Hi!
>
> I am using a setup of datafari (www.datafari.com), which more or less combines
> a ManifoldCF file index with SolR as a search engine.
>
> My setup consists of ~350000 files, which are composed mainly of doc(x),
> xls(x), msg and pdf files. pdf files are ocr'd externally before they are added
> to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd on-
> the-fly, when being imported.
>
> The files are actually part of an external file management system (files in the
> literal meaning of files, not files in the meaning of entities saved on the hard
> disk), which is not related to ManifoldCF/SolR at all. This system
> unfortunately does not provide a proper full text search, hence I implemented
> it as outlined above.
>
> However, the users are used to certain file numbers provided by this file
> management system. These file numbers are stored in a MSSQL database, which is
> accessible from the host my setup is running on. I can easily get the file
> number by sending a respective SQL statement based on the file name (of the
> entity saved on the hard disk) to the SQL Server. Hence, for each file name,
> there is a file number stored in the database. I would like to have these file
> numbers to be stored in a specific field of the solr index to be shown by the
> (tomcat) output, e.g:
>
> File name: /data/1003234234.docx
> Content: "This is the content. You searched for _text_."
> File name belongs to file number: SUI-G-25-A
>
> Is there any possibility to achieve that? Did I understand it correctly that
> this could happen either in ManifoldCF during indexing or in SolR during
> importing?
>
> I know that there is a tika plugin to talk to databases, which could be fed
> with a SQL statement. But how to connect it with the data retrieved from the
> files crawler?
>
> Alternatively, I could also call an external script (bash, python) to retrieve
> the respective data from the database using bsqldb.
>
> Any hint in the right direction is very much appreciated.
>
> Thanks in advance,
>
> Wilhelm
>