You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Wilhelm Eger <wi...@gmail.com> on 2017/02/22 12:01:17 UTC

Additional information from external database

Hi!

I am using a setup of datafari (www.datafari.com), which more or less combines 
a ManifoldCF file index with SolR as a search engine.

My setup consists of ~350000 files, which are composed mainly of doc(x), 
xls(x), msg and pdf files. pdf files are ocr'd externally before they are added 
to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd on-
the-fly, when being imported.

The files are actually part of an external file management system (files in the 
literal meaning of files, not files in the meaning of entities saved on the hard 
disk), which is not related to ManifoldCF/SolR at all. This system 
unfortunately does not provide a proper full text search, hence I implemented 
it as outlined above.

However, the users are used to certain file numbers provided by this file 
management system. These file numbers are stored in a MSSQL database, which is 
accessible from the host my setup is running on. I can easily get the file 
number by sending a respective SQL statement based on the file name (of the 
entity saved on the hard disk) to the SQL Server. Hence, for each file name, 
there is a file number stored in the database. I would like to have these file 
numbers to be stored in a specific field of the solr index to be shown by the 
(tomcat) output, e.g:

File name: /data/1003234234.docx
Content: "This is the content. You searched for _text_."
File name belongs to file number: SUI-G-25-A

Is there any possibility to achieve that? Did I understand it correctly that 
this could happen either in ManifoldCF during indexing or in SolR during 
importing?

I know that there is a tika plugin to talk to databases, which could be fed 
with a SQL statement. But how to connect it with the data retrieved from the 
files crawler?

Alternatively, I could also call an external script (bash, python) to retrieve 
the respective data from the database using bsqldb.

Any hint in the right direction is very much appreciated.

Thanks in advance,

Wilhelm


Re: Additional information from external database

Posted by Karl Wright <da...@gmail.com>.
Hi Wilhelm,

Documents that come from the file system connector have a URL that includes
the file name, so you should have a way of finding the file name in your
connector.  There is also a RepositoryDocument file name field that you can
get, and I believe that too will be set.

Accessing databases using connectors requires use of JDBC.  There is a JDBC
connector whose code you can look at that might show you the way.

Adding fields to a RepositoryDocument object is trivial.  You can look at
how other transformation connectors do it, e.g. the Metadata Adjuster
transformation connector.

The book here is slightly old in that it doesn't cover transformation
connectors or notification connectors, but there is a chapter on general
connector concepts which you will likely find very valuable.

https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs

Chapters 5 and 6 are where you need to spend most of your time.  Chapter 9
describes output connectors, which have some similarities to transformation
connectors.  I would also look carefully at the Metadata Adjuster
transformation connector's code to see how it is done.

Thanks,
Karl


On Wed, Feb 22, 2017 at 7:29 AM, Wilhelm Eger <wi...@gmail.com>
wrote:

> Hi Karl,
>
> Thanks a lot for your help.
>
> My Datafari setup uses a file system crawler to crawl files (repository
> connector -> job), from which text is extracted via the tika plugin. This
> is
> then sent to SolR via the SolR output connector.
>
> I am already using a transformation connector to add a field based on the
> name
> of the job (using the file system repository connector) to distinguish the
> origin of the indexed file later.
>
> Actually, I ended up at the same solution you presented me (but I did not
> mention it beforehand to spoil the answers): writing my own transformation
> connector to retrieve the information from the database. The connector
> should:
>
> - know the file name
> - compile a SQL statement from the file name
> - send this SQL statement to the database
> - retrieve the file number
> - add it to a certain field
>
> I do know little to nothing about java, but I am able to teach myself if
> necessary. Is there any starting point to begin with developing my on
> transformation connector?
>
> Thanks in advance,
>
> Wilhelm
>
> Am Mittwoch, 22. Februar 2017, 13:15:23 CET schrieb Karl Wright:
> > Hi Wilhelm,
> >
> > I don't know anything about how datafari uses ManifoldCF to crawl.  All I
> > can do is describe how ManifoldCF works, and then maybe you can see how
> it
> > integrates with datafari.
> >
> > MCF gets documents from a repository using one of many kinds of
> repository
> > connector.  It then can transform the document in many different ways,
> > before sending the (transformed) document to one of many output
> > connectors.  I gather that datafari injects documents primarily into
> Solr.
> >
> > Each job in MCF has its own "pipeline", which describes the flow of a
> > document through the system for that job.
> >
> > The transformations that are available in MCF include:
> >
> > - ability to extract metadata from the document (using Tika)
> > - ability to modify or add metadata properties (you specify this in the
> job
> > UI)
> > - OpenNLP metadata extraction
> > - Filter out documents based on characteristics of the document
> >
> > Writing connectors is relatively straightforward and there are online
> > materials available to help you do this. I can provide a link, if you
> need
> > it.  Without any more information as to what exactly you are using for a
> > repository connector, and what that connector provides as part of the
> > document information, I can't really give you the best approach here, but
> > it may be possible to write a transformation connector that would look up
> > the information you want to add as metadata from your database and
> include
> > that in the document that gets sent to Solr.
> >
> > Please let us know how we can help.
> >
> > Thanks,
> > Karl
> >
> >
> > On Wed, Feb 22, 2017 at 7:01 AM, Wilhelm Eger <wi...@gmail.com>
> >
> > wrote:
> > > Hi!
> > >
> > > I am using a setup of datafari (www.datafari.com), which more or less
> > > combines
> > > a ManifoldCF file index with SolR as a search engine.
> > >
> > > My setup consists of ~350000 files, which are composed mainly of
> doc(x),
> > > xls(x), msg and pdf files. pdf files are ocr'd externally before they
> are
> > > added
> > > to the ManifoldCF index. Only remaining image files (png, jpg) are
> ocr'd
> > > on-
> > > the-fly, when being imported.
> > >
> > > The files are actually part of an external file management system
> (files
> > > in the
> > > literal meaning of files, not files in the meaning of entities saved on
> > > the hard
> > > disk), which is not related to ManifoldCF/SolR at all. This system
> > > unfortunately does not provide a proper full text search, hence I
> > > implemented
> > > it as outlined above.
> > >
> > > However, the users are used to certain file numbers provided by this
> file
> > > management system. These file numbers are stored in a MSSQL database,
> > > which is
> > > accessible from the host my setup is running on. I can easily get the
> file
> > > number by sending a respective SQL statement based on the file name (of
> > > the
> > > entity saved on the hard disk) to the SQL Server. Hence, for each file
> > > name,
> > > there is a file number stored in the database. I would like to have
> these
> > > file
> > > numbers to be stored in a specific field of the solr index to be shown
> by
> > > the
> > > (tomcat) output, e.g:
> > >
> > > File name: /data/1003234234.docx
> > > Content: "This is the content. You searched for _text_."
> > > File name belongs to file number: SUI-G-25-A
> > >
> > > Is there any possibility to achieve that? Did I understand it correctly
> > > that
> > > this could happen either in ManifoldCF during indexing or in SolR
> during
> > > importing?
> > >
> > > I know that there is a tika plugin to talk to databases, which could be
> > > fed
> > > with a SQL statement. But how to connect it with the data retrieved
> from
> > > the
> > > files crawler?
> > >
> > > Alternatively, I could also call an external script (bash, python) to
> > > retrieve
> > > the respective data from the database using bsqldb.
> > >
> > > Any hint in the right direction is very much appreciated.
> > >
> > > Thanks in advance,
> > >
> > > Wilhelm
>
>
>

Re: Additional information from external database

Posted by Wilhelm Eger <wi...@gmail.com>.
Hi Karl,

Thanks a lot for your help.

My Datafari setup uses a file system crawler to crawl files (repository 
connector -> job), from which text is extracted via the tika plugin. This is 
then sent to SolR via the SolR output connector.

I am already using a transformation connector to add a field based on the name 
of the job (using the file system repository connector) to distinguish the 
origin of the indexed file later.

Actually, I ended up at the same solution you presented me (but I did not 
mention it beforehand to spoil the answers): writing my own transformation 
connector to retrieve the information from the database. The connector should:

- know the file name
- compile a SQL statement from the file name
- send this SQL statement to the database
- retrieve the file number
- add it to a certain field

I do know little to nothing about java, but I am able to teach myself if 
necessary. Is there any starting point to begin with developing my on 
transformation connector?

Thanks in advance,

Wilhelm

Am Mittwoch, 22. Februar 2017, 13:15:23 CET schrieb Karl Wright:
> Hi Wilhelm,
> 
> I don't know anything about how datafari uses ManifoldCF to crawl.  All I
> can do is describe how ManifoldCF works, and then maybe you can see how it
> integrates with datafari.
> 
> MCF gets documents from a repository using one of many kinds of repository
> connector.  It then can transform the document in many different ways,
> before sending the (transformed) document to one of many output
> connectors.  I gather that datafari injects documents primarily into Solr.
> 
> Each job in MCF has its own "pipeline", which describes the flow of a
> document through the system for that job.
> 
> The transformations that are available in MCF include:
> 
> - ability to extract metadata from the document (using Tika)
> - ability to modify or add metadata properties (you specify this in the job
> UI)
> - OpenNLP metadata extraction
> - Filter out documents based on characteristics of the document
> 
> Writing connectors is relatively straightforward and there are online
> materials available to help you do this. I can provide a link, if you need
> it.  Without any more information as to what exactly you are using for a
> repository connector, and what that connector provides as part of the
> document information, I can't really give you the best approach here, but
> it may be possible to write a transformation connector that would look up
> the information you want to add as metadata from your database and include
> that in the document that gets sent to Solr.
> 
> Please let us know how we can help.
> 
> Thanks,
> Karl
> 
> 
> On Wed, Feb 22, 2017 at 7:01 AM, Wilhelm Eger <wi...@gmail.com>
> 
> wrote:
> > Hi!
> > 
> > I am using a setup of datafari (www.datafari.com), which more or less
> > combines
> > a ManifoldCF file index with SolR as a search engine.
> > 
> > My setup consists of ~350000 files, which are composed mainly of doc(x),
> > xls(x), msg and pdf files. pdf files are ocr'd externally before they are
> > added
> > to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd
> > on-
> > the-fly, when being imported.
> > 
> > The files are actually part of an external file management system (files
> > in the
> > literal meaning of files, not files in the meaning of entities saved on
> > the hard
> > disk), which is not related to ManifoldCF/SolR at all. This system
> > unfortunately does not provide a proper full text search, hence I
> > implemented
> > it as outlined above.
> > 
> > However, the users are used to certain file numbers provided by this file
> > management system. These file numbers are stored in a MSSQL database,
> > which is
> > accessible from the host my setup is running on. I can easily get the file
> > number by sending a respective SQL statement based on the file name (of
> > the
> > entity saved on the hard disk) to the SQL Server. Hence, for each file
> > name,
> > there is a file number stored in the database. I would like to have these
> > file
> > numbers to be stored in a specific field of the solr index to be shown by
> > the
> > (tomcat) output, e.g:
> > 
> > File name: /data/1003234234.docx
> > Content: "This is the content. You searched for _text_."
> > File name belongs to file number: SUI-G-25-A
> > 
> > Is there any possibility to achieve that? Did I understand it correctly
> > that
> > this could happen either in ManifoldCF during indexing or in SolR during
> > importing?
> > 
> > I know that there is a tika plugin to talk to databases, which could be
> > fed
> > with a SQL statement. But how to connect it with the data retrieved from
> > the
> > files crawler?
> > 
> > Alternatively, I could also call an external script (bash, python) to
> > retrieve
> > the respective data from the database using bsqldb.
> > 
> > Any hint in the right direction is very much appreciated.
> > 
> > Thanks in advance,
> > 
> > Wilhelm



Re: Additional information from external database

Posted by Karl Wright <da...@gmail.com>.
Hi Wilhelm,

I don't know anything about how datafari uses ManifoldCF to crawl.  All I
can do is describe how ManifoldCF works, and then maybe you can see how it
integrates with datafari.

MCF gets documents from a repository using one of many kinds of repository
connector.  It then can transform the document in many different ways,
before sending the (transformed) document to one of many output
connectors.  I gather that datafari injects documents primarily into Solr.

Each job in MCF has its own "pipeline", which describes the flow of a
document through the system for that job.

The transformations that are available in MCF include:

- ability to extract metadata from the document (using Tika)
- ability to modify or add metadata properties (you specify this in the job
UI)
- OpenNLP metadata extraction
- Filter out documents based on characteristics of the document

Writing connectors is relatively straightforward and there are online
materials available to help you do this. I can provide a link, if you need
it.  Without any more information as to what exactly you are using for a
repository connector, and what that connector provides as part of the
document information, I can't really give you the best approach here, but
it may be possible to write a transformation connector that would look up
the information you want to add as metadata from your database and include
that in the document that gets sent to Solr.

Please let us know how we can help.

Thanks,
Karl


On Wed, Feb 22, 2017 at 7:01 AM, Wilhelm Eger <wi...@gmail.com>
wrote:

> Hi!
>
> I am using a setup of datafari (www.datafari.com), which more or less
> combines
> a ManifoldCF file index with SolR as a search engine.
>
> My setup consists of ~350000 files, which are composed mainly of doc(x),
> xls(x), msg and pdf files. pdf files are ocr'd externally before they are
> added
> to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd
> on-
> the-fly, when being imported.
>
> The files are actually part of an external file management system (files
> in the
> literal meaning of files, not files in the meaning of entities saved on
> the hard
> disk), which is not related to ManifoldCF/SolR at all. This system
> unfortunately does not provide a proper full text search, hence I
> implemented
> it as outlined above.
>
> However, the users are used to certain file numbers provided by this file
> management system. These file numbers are stored in a MSSQL database,
> which is
> accessible from the host my setup is running on. I can easily get the file
> number by sending a respective SQL statement based on the file name (of the
> entity saved on the hard disk) to the SQL Server. Hence, for each file
> name,
> there is a file number stored in the database. I would like to have these
> file
> numbers to be stored in a specific field of the solr index to be shown by
> the
> (tomcat) output, e.g:
>
> File name: /data/1003234234.docx
> Content: "This is the content. You searched for _text_."
> File name belongs to file number: SUI-G-25-A
>
> Is there any possibility to achieve that? Did I understand it correctly
> that
> this could happen either in ManifoldCF during indexing or in SolR during
> importing?
>
> I know that there is a tika plugin to talk to databases, which could be fed
> with a SQL statement. But how to connect it with the data retrieved from
> the
> files crawler?
>
> Alternatively, I could also call an external script (bash, python) to
> retrieve
> the respective data from the database using bsqldb.
>
> Any hint in the right direction is very much appreciated.
>
> Thanks in advance,
>
> Wilhelm
>
>