You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Shinichiro Abe <sh...@gmail.com> on 2011/07/29 10:35:42 UTC

Data query of JDBC repo

Hello.

I used JDBC Repository Connection and created 
the following view table[1] on postgesql.
I set the default setting at Queries tab in job lists.
I run the job, then on the Solr, only urlfield was indexed as id field.

1)I also want to index datafield. What is needed to set?
Can I use it like solr dataimporthandler?
For example, can it index datafield1, datafield2, datafield…?

2)Why ingesting datafield need to know if url is valid in source code?
I want to index datafield without urlfield.

My usage may be wrong, I assumed that string data of datafield is indexed as contents.
I want to know what kind of table Data-query assume.

[1]view:documenttable
| idfield             | versionfield     | urlfield             | datafield         | modifydatefield
| char varying  | char varying    | char varying   | char varying  | bigint
--------------------------------------------------------------------------
| 1                      | 1                        | file:///dummy/1| test string       | 1
| 2                      | 1                        | file:///dummy/2| test info          | 1

Thank you,
Shinichiro Abe

Re: Data query of JDBC repo

Posted by Shinichiro Abe <sh...@gmail.com>.

Thank you. Indexing data of VARCHAR worked well. My solrconfig setting was incorrect.

Shinichiro

On 2011/07/29, at 19:06, Karl Wright wrote:

> Oh, FWIW, content data of type VARCHAR should also work.
> Karl
> 
> On Fri, Jul 29, 2011 at 6:05 AM, Karl Wright <da...@gmail.com> wrote:
>> I believe the end-user documentation talks about this to some extent.
>> Nevertheless, the JDBC handler is designed to pull all the necessary
>> information for a document, including the content data, out of a
>> single database table.  So it presumes the content is stored as either
>> CLOB data or BLOB data in one column of the table.
>> 
>> The url field is necessary because that is what ManifoldCF uses for
>> the "id" in the target search engine.  It needs this to be able to
>> remove or replace the document in the target on subsequent job runs.
>> It might as well be a URL because it presumes that the search user
>> will need some way to get to the content of the indexed document.
>> 
>> Hope that answers your question.
>> 
>> Karl
>> 
>> 2011/7/29 Shinichiro Abe <sh...@gmail.com>:
>>> Hello.
>>> 
>>> I used JDBC Repository Connection and created
>>> the following view table[1] on postgesql.
>>> I set the default setting at Queries tab in job lists.
>>> I run the job, then on the Solr, only urlfield was indexed as id field.
>>> 
>>> 1)I also want to index datafield. What is needed to set?
>>> Can I use it like solr dataimporthandler?
>>> For example, can it index datafield1, datafield2, datafield…?
>>> 
>>> 2)Why ingesting datafield need to know if url is valid in source code?
>>> I want to index datafield without urlfield.
>>> 
>>> My usage may be wrong, I assumed that string data of datafield is indexed as contents.
>>> I want to know what kind of table Data-query assume.
>>> 
>>> [1]view:documenttable
>>> | idfield             | versionfield     | urlfield             | datafield         | modifydatefield
>>> | char varying  | char varying    | char varying   | char varying  | bigint
>>> --------------------------------------------------------------------------
>>> | 1                      | 1                        | file:///dummy/1| test string       | 1
>>> | 2                      | 1                        | file:///dummy/2| test info          | 1
>>> 
>>> Thank you,
>>> Shinichiro Abe
>>

Re: Data query of JDBC repo

Posted by Karl Wright <da...@gmail.com>.

Oh, FWIW, content data of type VARCHAR should also work.
Karl

On Fri, Jul 29, 2011 at 6:05 AM, Karl Wright <da...@gmail.com> wrote:
> I believe the end-user documentation talks about this to some extent.
> Nevertheless, the JDBC handler is designed to pull all the necessary
> information for a document, including the content data, out of a
> single database table.  So it presumes the content is stored as either
> CLOB data or BLOB data in one column of the table.
>
> The url field is necessary because that is what ManifoldCF uses for
> the "id" in the target search engine.  It needs this to be able to
> remove or replace the document in the target on subsequent job runs.
> It might as well be a URL because it presumes that the search user
> will need some way to get to the content of the indexed document.
>
> Hope that answers your question.
>
> Karl
>
> 2011/7/29 Shinichiro Abe <sh...@gmail.com>:
>> Hello.
>>
>> I used JDBC Repository Connection and created
>> the following view table[1] on postgesql.
>> I set the default setting at Queries tab in job lists.
>> I run the job, then on the Solr, only urlfield was indexed as id field.
>>
>> 1)I also want to index datafield. What is needed to set?
>> Can I use it like solr dataimporthandler?
>> For example, can it index datafield1, datafield2, datafield…?
>>
>> 2)Why ingesting datafield need to know if url is valid in source code?
>> I want to index datafield without urlfield.
>>
>> My usage may be wrong, I assumed that string data of datafield is indexed as contents.
>> I want to know what kind of table Data-query assume.
>>
>> [1]view:documenttable
>> | idfield             | versionfield     | urlfield             | datafield         | modifydatefield
>> | char varying  | char varying    | char varying   | char varying  | bigint
>> --------------------------------------------------------------------------
>> | 1                      | 1                        | file:///dummy/1| test string       | 1
>> | 2                      | 1                        | file:///dummy/2| test info          | 1
>>
>> Thank you,
>> Shinichiro Abe
>

Re: Data query of JDBC repo

Posted by Karl Wright <da...@gmail.com>.

I believe the end-user documentation talks about this to some extent.
Nevertheless, the JDBC handler is designed to pull all the necessary
information for a document, including the content data, out of a
single database table.  So it presumes the content is stored as either
CLOB data or BLOB data in one column of the table.

The url field is necessary because that is what ManifoldCF uses for
the "id" in the target search engine.  It needs this to be able to
remove or replace the document in the target on subsequent job runs.
It might as well be a URL because it presumes that the search user
will need some way to get to the content of the indexed document.

Hope that answers your question.

Karl

2011/7/29 Shinichiro Abe <sh...@gmail.com>:
> Hello.
>
> I used JDBC Repository Connection and created
> the following view table[1] on postgesql.
> I set the default setting at Queries tab in job lists.
> I run the job, then on the Solr, only urlfield was indexed as id field.
>
> 1)I also want to index datafield. What is needed to set?
> Can I use it like solr dataimporthandler?
> For example, can it index datafield1, datafield2, datafield…?
>
> 2)Why ingesting datafield need to know if url is valid in source code?
> I want to index datafield without urlfield.
>
> My usage may be wrong, I assumed that string data of datafield is indexed as contents.
> I want to know what kind of table Data-query assume.
>
> [1]view:documenttable
> | idfield             | versionfield     | urlfield             | datafield         | modifydatefield
> | char varying  | char varying    | char varying   | char varying  | bigint
> --------------------------------------------------------------------------
> | 1                      | 1                        | file:///dummy/1| test string       | 1
> | 2                      | 1                        | file:///dummy/2| test info          | 1
>
> Thank you,
> Shinichiro Abe