You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Jens Jahnke <je...@wegtam.com> on 2014/09/19 09:26:42 UTC

Indexing JDBC data

Hi,

I'm new to manifold and I try to index some stuff from a mysql db via
jdbc into an elasticsearch index.

So far I've used the simple mapping example from the user docs for
elasticsearch and I've the following query for data collection:

SELECT id AS $(IDCOLUMN),
CONCAT("http://my.base.url/show.html?record=", id) AS $(URLCOLUMN),
CONCAT(name, " ", description, " ", what_ever) AS $(DATACOLUMN)
FROM accounts WHERE id IN $(IDLIST)

If I run the indexing job the data is fetched from the db and stored
into elasticsearch. But I've noticed 2 things:

1. The actual content field in the mapping is base64 encoded and
therefore not searchable?

2. If I do a base64decode on the content field then I see that it the
value from URLCOLUMN but not the one from DATACOLUMN.

Can anyone shed some light on this?

Regards,

Jens

-- 
19. Scheiding 2014, 09:15
Homepage : http://www.wegtam.com

Integrity has no need for rules.

Re: Indexing JDBC data

Posted by Karl Wright <da...@gmail.com>.
I've looked at the ElasticSearch connector code.  Bear in mind that this
connector is a contribution from folks who know more about ElasticSearch
than I do. The base64 encoding is apparently part of the design, because it
is presumed that transmission of binary documents to ES is necessary:

>>>>>>
        pw.print("{");
        Iterator<String> i = document.getFields();
        boolean needComma = false;
        while (i.hasNext()){
          String fieldName = i.next();
          String[] fieldValues = document.getFieldAsStrings(fieldName);
          needComma = writeField(pw, needComma, fieldName, fieldValues);
        }

        needComma = writeACLs(pw, needComma, "document", acls, denyAcls);
        needComma = writeACLs(pw, needComma, "share", shareAcls,
shareDenyAcls);
        needComma = writeACLs(pw, needComma, "parent", parentAcls,
parentDenyAcls);

        if(inputStream!=null){
          if(needComma){
            pw.print(",");
          }
          // I'm told this is not necessary: see CONNECTORS-690
          //pw.print("\"type\" : \"attachment\",");
          pw.print("\"file\" : {");
          String contentType = document.getMimeType();
          if (contentType != null)
            pw.print("\"_content_type\" :
"+jsonStringEscape(contentType)+",");
          String fileName = document.getFileName();
          if (fileName != null)
            pw.print("\"_name\" : "+jsonStringEscape(fileName)+",");
          pw.print(" \"content\" : \"");
          Base64 base64 = new Base64();
          base64.encodeStream(inputStream, pw);
          pw.print("\"}");
        }

        pw.print("}");
<<<<<<

If you think this is incorrect, please let me know.  We *could*, for
example, require that the ES connector only be handed text documents, and
thus all extraction of binary would have to be done with a Tika
transformation connection in the pipeline.

I don't see any way, though, that this connector can confuse one field with
another.  I'd go back to your original query to see if that's possible at
that level.

Thanks!
Karl


On Fri, Sep 19, 2014 at 4:43 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Jens,
>
> The queries look correct.
> If you try indexing some small amount of file content through the same
> output connection (using, say, the Filesystem connector), do you see the
> same thing?  I would bet so; if that's the case, something is clearly wrong
> with either how your elasticsearch connector is configured, or there's a
> bug.
>
> Karl
>
>
> On Fri, Sep 19, 2014 at 3:26 AM, Jens Jahnke <je...@wegtam.com> wrote:
>
>> Hi,
>>
>> I'm new to manifold and I try to index some stuff from a mysql db via
>> jdbc into an elasticsearch index.
>>
>> So far I've used the simple mapping example from the user docs for
>> elasticsearch and I've the following query for data collection:
>>
>> SELECT id AS $(IDCOLUMN),
>> CONCAT("http://my.base.url/show.html?record=", id) AS $(URLCOLUMN),
>> CONCAT(name, " ", description, " ", what_ever) AS $(DATACOLUMN)
>> FROM accounts WHERE id IN $(IDLIST)
>>
>> If I run the indexing job the data is fetched from the db and stored
>> into elasticsearch. But I've noticed 2 things:
>>
>> 1. The actual content field in the mapping is base64 encoded and
>> therefore not searchable?
>>
>> 2. If I do a base64decode on the content field then I see that it the
>> value from URLCOLUMN but not the one from DATACOLUMN.
>>
>> Can anyone shed some light on this?
>>
>> Regards,
>>
>> Jens
>>
>> --
>> 19. Scheiding 2014, 09:15
>> Homepage : http://www.wegtam.com
>>
>> Integrity has no need for rules.
>>
>
>

Re: Indexing JDBC data

Posted by Karl Wright <da...@gmail.com>.
Hi Jens,

Yes, this kind of thing is definitely worth a few sentences in the end-user
documentation.  If you'd like to create a ticket for that in Jira (
https://issues.apache.org/jira), and even better, attach a short patch, I'd
be very happy to include it.

Thanks, and I'm glad you figured it out!

Karl


On Fri, Sep 19, 2014 at 7:18 AM, Jens Jahnke <je...@wegtam.com> wrote:

> Hi Karl,
>
> On Fri, 19 Sep 2014 06:23:36 -0400
> Karl Wright <da...@gmail.com> wrote:
>
> KW> Just to be sure, can you run the same query in the mysql command-line
> KW> interface, and make sure it returns what is expected?  For the id list,
> KW> substitute a comma-separated list of id column values, and for the $
> data
> KW> columns, substitute names of your choice.
> KW>
> KW> After that, we can instrument the output of the JDBC calls in the
> KW> connector, if needed, to help put together a MySQL bug ticket.
>
> I have executed the query on the command line and the data column was
> NULL. After a bit of testing I saw that this is related to the nature
> of CONCAT. If one of the columns used for the concat contains a null
> value then the whole concat returns NULL.
>
> mysql> SELECT CONCAT('foo', 'bar');
> +----------------------+
> | CONCAT('foo', 'bar') |
> +----------------------+
> | foobar               |
> +----------------------+
> 1 row in set (0.02 sec)
>
> mysql> SELECT CONCAT('foo', NULL);
> +---------------------+
> | CONCAT('foo', NULL) |
> +---------------------+
> | NULL                |
> +---------------------+
> 1 row in set (0.02 sec)
>
> Currently I'm working around that issue using IFNULL for columns that
> may be null:
>
> SELECT id AS $(IDCOLUMN),
> CONCAT("http://my.base.url/show.html?record=", id) AS $(URLCOLUMN),
> CONCAT(name, " ", IFNULL(description, ""), " ", IFNULL(what_ever, ""))
> AS $(DATACOLUMN)
> FROM accounts WHERE id IN $(IDLIST)
>
> Maybe this is worth an entry in the user documentation? :-)
>
> Regards,
>
> Jens
>
> --
> 19. Scheiding 2014, 13:11
> Homepage : http://www.wegtam.com
>
> Play Rogue, visit exotic locations, meet strange creatures and kill
> them.
>

Re: Indexing JDBC data

Posted by Jens Jahnke <je...@wegtam.com>.
Hi Karl,

On Fri, 19 Sep 2014 06:23:36 -0400
Karl Wright <da...@gmail.com> wrote:

KW> Just to be sure, can you run the same query in the mysql command-line
KW> interface, and make sure it returns what is expected?  For the id list,
KW> substitute a comma-separated list of id column values, and for the $ data
KW> columns, substitute names of your choice.
KW> 
KW> After that, we can instrument the output of the JDBC calls in the
KW> connector, if needed, to help put together a MySQL bug ticket.

I have executed the query on the command line and the data column was
NULL. After a bit of testing I saw that this is related to the nature
of CONCAT. If one of the columns used for the concat contains a null
value then the whole concat returns NULL.

mysql> SELECT CONCAT('foo', 'bar');
+----------------------+
| CONCAT('foo', 'bar') |
+----------------------+
| foobar               |
+----------------------+
1 row in set (0.02 sec)

mysql> SELECT CONCAT('foo', NULL);
+---------------------+
| CONCAT('foo', NULL) |
+---------------------+
| NULL                |
+---------------------+
1 row in set (0.02 sec)

Currently I'm working around that issue using IFNULL for columns that
may be null:

SELECT id AS $(IDCOLUMN),
CONCAT("http://my.base.url/show.html?record=", id) AS $(URLCOLUMN),
CONCAT(name, " ", IFNULL(description, ""), " ", IFNULL(what_ever, ""))
AS $(DATACOLUMN)
FROM accounts WHERE id IN $(IDLIST)

Maybe this is worth an entry in the user documentation? :-)

Regards,

Jens

-- 
19. Scheiding 2014, 13:11
Homepage : http://www.wegtam.com

Play Rogue, visit exotic locations, meet strange creatures and kill
them.

Re: Indexing JDBC data

Posted by Karl Wright <da...@gmail.com>.
Hmm, then it's conceivable we may be seeing a longstanding bug in MySQL's
JDBC driver.

Just to be sure, can you run the same query in the mysql command-line
interface, and make sure it returns what is expected?  For the id list,
substitute a comma-separated list of id column values, and for the $ data
columns, substitute names of your choice.

After that, we can instrument the output of the JDBC calls in the
connector, if needed, to help put together a MySQL bug ticket.

Karl




On Fri, Sep 19, 2014 at 6:14 AM, Jens Jahnke <je...@wegtam.com> wrote:

> On Fri, 19 Sep 2014 11:19:39 +0200
> Jens Jahnke <je...@wegtam.com> wrote:
>
> JJ> That leaves the version, I'll try to use the current version (5.1.32)
> JJ> of the mysql connector and will report back here.
>
> Using the current mysql connector version doesn't affect the
> outcome. :-(
>
> --
> 19. Scheiding 2014, 12:07
> Homepage : http://www.wegtam.com
>
> Your love life will be happy and harmonious.
>

Re: Indexing JDBC data

Posted by Jens Jahnke <je...@wegtam.com>.
On Fri, 19 Sep 2014 11:19:39 +0200
Jens Jahnke <je...@wegtam.com> wrote:

JJ> That leaves the version, I'll try to use the current version (5.1.32)
JJ> of the mysql connector and will report back here.

Using the current mysql connector version doesn't affect the
outcome. :-(

-- 
19. Scheiding 2014, 12:07
Homepage : http://www.wegtam.com

Your love life will be happy and harmonious.

Re: Indexing JDBC data

Posted by Jens Jahnke <je...@wegtam.com>.
Hi Karl,

On Fri, 19 Sep 2014 05:10:25 -0400
Karl Wright <da...@gmail.com> wrote:

KW> So that leaves more unusual explanations.  For the MySQL JDBC connection,
KW> have you selected "by label"?  That's required for MySQL; without it the
KW> column mappings may well be strange.  Also, please make sure that the
KW> version of the MySQL JDBC driver you are downloading is the correct one for
KW> your MySQL version.  If you use "ant make-deps" for that, please check the
KW> version specified in the build.xml file (I think it's 5.1.18, but have a
KW> look).

I'm using "by label". If that is not enabled I get a nice error message
when I try to run the job explaining I should enable that for mysql
connections.

That leaves the version, I'll try to use the current version (5.1.32)
of the mysql connector and will report back here.

KW> One more thing: does the URL that ES sees look right?  or is it something
KW> wacky too?

They look absolutely fine.

Regards,

Jens

-- 
19. Scheiding 2014, 11:12
Homepage : http://www.wegtam.com

The most disagreeable thing that your worst enemy says to your face does
not approach what your best friends say behind your back.
		-- Alfred De Musset

Re: Indexing JDBC data

Posted by Karl Wright <da...@gmail.com>.
Hi Jens,

I've had a look at the JDBC connector code, and can see no way it by itself
can get this messed up.

The column names are distinct:

>>>>>>
  /** The name of the url return column */
  public static String urlReturnColumnName = "lcf__url";
  /** The name of the data return column */
  public static String dataReturnColumnName = "lcf__data";
<<<<<<

The content result uses the right column:

>>>>>>
                Object contents =
row.getValue(JDBCConstants.dataReturnColumnName);
<<<<<<

So that leaves more unusual explanations.  For the MySQL JDBC connection,
have you selected "by label"?  That's required for MySQL; without it the
column mappings may well be strange.  Also, please make sure that the
version of the MySQL JDBC driver you are downloading is the correct one for
your MySQL version.  If you use "ant make-deps" for that, please check the
version specified in the build.xml file (I think it's 5.1.18, but have a
look).

One more thing: does the URL that ES sees look right?  or is it something
wacky too?

Thanks,
Karl


On Fri, Sep 19, 2014 at 4:55 AM, Jens Jahnke <je...@wegtam.com> wrote:

> Hi Karl,
>
> On Fri, 19 Sep 2014 04:43:02 -0400
> Karl Wright <da...@gmail.com> wrote:
>
> KW> If you try indexing some small amount of file content through the same
> KW> output connection (using, say, the Filesystem connector), do you see
> the
> KW> same thing?  I would bet so; if that's the case, something is clearly
> wrong
> KW> with either how your elasticsearch connector is configured, or there's
> a
> KW> bug.
>
> I've indexed a small windows share through the same output connector
> and they are indexed correctly.
>
> The content is also base64 encoded but I found out that it is
> searchable. Searching the jbcd indexed data for strings from the url
> shows matches which makes sense because the url column is indexed
> instead of the data column.
>
> But that still leaves me wondering why the url column is used as the
> indexed content for the jdbc connection. :-?
>
> Regards,
>
> Jens
>
> --
> 19. Scheiding 2014, 10:52
> Homepage : http://www.wegtam.com
>
> The way I understand it, the Russians are sort of a combination of evil
> and incompetence... sort of like the Post Office with tanks.
>                 -- Emo Philips
>

Re: Indexing JDBC data

Posted by Jens Jahnke <je...@wegtam.com>.
Hi Karl,

On Fri, 19 Sep 2014 04:43:02 -0400
Karl Wright <da...@gmail.com> wrote:

KW> If you try indexing some small amount of file content through the same
KW> output connection (using, say, the Filesystem connector), do you see the
KW> same thing?  I would bet so; if that's the case, something is clearly wrong
KW> with either how your elasticsearch connector is configured, or there's a
KW> bug.

I've indexed a small windows share through the same output connector
and they are indexed correctly.

The content is also base64 encoded but I found out that it is
searchable. Searching the jbcd indexed data for strings from the url
shows matches which makes sense because the url column is indexed
instead of the data column.

But that still leaves me wondering why the url column is used as the
indexed content for the jdbc connection. :-?

Regards,

Jens

-- 
19. Scheiding 2014, 10:52
Homepage : http://www.wegtam.com

The way I understand it, the Russians are sort of a combination of evil
and incompetence... sort of like the Post Office with tanks.
		-- Emo Philips

Re: Indexing JDBC data

Posted by Karl Wright <da...@gmail.com>.
Hi Jens,

The queries look correct.
If you try indexing some small amount of file content through the same
output connection (using, say, the Filesystem connector), do you see the
same thing?  I would bet so; if that's the case, something is clearly wrong
with either how your elasticsearch connector is configured, or there's a
bug.

Karl


On Fri, Sep 19, 2014 at 3:26 AM, Jens Jahnke <je...@wegtam.com> wrote:

> Hi,
>
> I'm new to manifold and I try to index some stuff from a mysql db via
> jdbc into an elasticsearch index.
>
> So far I've used the simple mapping example from the user docs for
> elasticsearch and I've the following query for data collection:
>
> SELECT id AS $(IDCOLUMN),
> CONCAT("http://my.base.url/show.html?record=", id) AS $(URLCOLUMN),
> CONCAT(name, " ", description, " ", what_ever) AS $(DATACOLUMN)
> FROM accounts WHERE id IN $(IDLIST)
>
> If I run the indexing job the data is fetched from the db and stored
> into elasticsearch. But I've noticed 2 things:
>
> 1. The actual content field in the mapping is base64 encoded and
> therefore not searchable?
>
> 2. If I do a base64decode on the content field then I see that it the
> value from URLCOLUMN but not the one from DATACOLUMN.
>
> Can anyone shed some light on this?
>
> Regards,
>
> Jens
>
> --
> 19. Scheiding 2014, 09:15
> Homepage : http://www.wegtam.com
>
> Integrity has no need for rules.
>