You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by "Nichols, Richard" <Ri...@tellabs.com> on 2013/06/07 18:43:13 UTC

ElasticSearch Oddities

Karl,

Now that we have MCF sending documents to ES so that they are properly being scanned, I'm finding a couple of oddities.

I'm using the JDBC connector to feed ES, where the main 'document' (identified by the $(DATACOLUMN) variable) is in XML.  Therefore, I set the $(CONTENTTYPE) column to 'application/xml'.   Generally, this works.  But...


1)      I didn't set the "Allowed MIME Types" on the ES tab in the job to allow "application/xml".  I was expecting to have all of the rows filtered out.  That didn't happen.  All rows returned were indexed by ES anyway.

2)      Some of the columns (which are of type nvarchar) have embedded linefeed and/or return characters in them (e.g. mult-line addresses).  These are getting flagged as JSON errors by ES (as containing an 'unescaped character').  I see that ElasticSearchIndex::jsonStringEscape() doesn't deal with non-printable characters.  Should it?

Regards,
Rick

Richard D. Nichols
Staff Engineer
Tellabs, Inc.
18583 N. Dallas Parkway
Dallas, TX  75287
Office: (972) 588-6942
richard.nichols@tellabs.com
[cid:image001.jpg@01CE6372.D4586EF0]<http://www.tellabs.com/>[cid:image002.jpg@01CE6372.D4586EF0]<http://www.twitter.com/tellabs>[cid:image003.jpg@01CE6372.D4586EF0]<http://www.tellabs.com/blog>
Want the latest news on what's driving the telecom industry? Subscribe to Tellabs Insight Magazine<http://www.tellabs.com/news/insight/subscribe.cfm>



============================================================
The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs
============================================================

Re: ElasticSearch Oddities

Posted by Karl Wright <da...@gmail.com>.
Fixes for both of these have been checked into trunk.
Karl


On Fri, Jun 7, 2013 at 12:56 PM, Karl Wright <da...@gmail.com> wrote:

> CONNECTORS-707 and CONNECTORS-708.
>
> Karl
>
>
>
> On Fri, Jun 7, 2013 at 12:48 PM, Karl Wright <da...@gmail.com> wrote:
>
>> >>>>>>
>> 1)      I didn’t set the “Allowed MIME Types” on the ES tab in the job
>> to allow “application/xml”.  I was expecting to have all of the rows
>> filtered out.  That didn’t happen.  All rows returned were indexed by ES
>> anyway.
>> <<<<<<
>>
>> That's probably because the JDBC connector does not call the appropriate
>> method to check whether the mimetype will be accepted by the output
>> connector or not.  It's up to the repository connector to do this, and is
>> optional.  But this is worth creating a ticket for I think.
>>
>>
>> >>>>>>
>>  2)      Some of the columns (which are of type nvarchar) have embedded
>> linefeed and/or return characters in them (e.g. mult-line addresses).
>> These are getting flagged as JSON errors by ES (as containing an ‘unescaped
>> character’).  I see that ElasticSearchIndex::
>>
>> jsonStringEscape() doesn’t deal with non-printable characters.  Should it?
>>
>> <<<<<<
>>
>>
>> Yes.  This one definitely should have a ticket.
>>
>>
>> Karl
>>
>>
>>
>>
>> On Fri, Jun 7, 2013 at 12:43 PM, Nichols, Richard <
>> Richard.Nichols@tellabs.com> wrote:
>>
>>>  Karl,****
>>>
>>> ** **
>>>
>>> Now that we have MCF sending documents to ES so that they are properly
>>> being scanned, I’m finding a couple of oddities.****
>>>
>>> ** **
>>>
>>> I’m using the JDBC connector to feed ES, where the main ‘document’
>>> (identified by the $(DATACOLUMN) variable) is in XML.  Therefore, I set the
>>> $(CONTENTTYPE) column to ‘application/xml’.   Generally, this works.  But…
>>> ****
>>>
>>> ** **
>>>
>>> **1)      **I didn’t set the “Allowed MIME Types” on the ES tab in the
>>> job to allow “application/xml”.  I was expecting to have all of the rows
>>> filtered out.  That didn’t happen.  All rows returned were indexed by ES
>>> anyway.****
>>>
>>> **2)      **Some of the columns (which are of type nvarchar) have
>>> embedded linefeed and/or return characters in them (e.g. mult-line
>>> addresses).  These are getting flagged as JSON errors by ES (as containing
>>> an ‘unescaped character’).  I see that
>>> ElasticSearchIndex::jsonStringEscape() doesn’t deal with non-printable
>>> characters.  Should it?****
>>>
>>> ** **
>>>
>>> Regards,****
>>>
>>> Rick****
>>>
>>> ** **
>>>
>>> Richard D. Nichols****
>>>
>>> Staff Engineer****
>>>
>>> Tellabs, Inc.****
>>>
>>> 18583 N. Dallas Parkway****
>>>
>>> Dallas, TX  75287****
>>>
>>> Office: (972) 588-6942****
>>>
>>> richard.nichols@tellabs.com****
>>>
>>> [image: Tellabs] <http://www.tellabs.com/>[image: TellabsTwitter]<http://www.twitter.com/tellabs>[image:
>>> TellabsBlog] <http://www.tellabs.com/blog>****
>>>
>>> Want the latest news on what’s driving the telecom industry? *Subscribe
>>> to Tellabs Insight Magazine<http://www.tellabs.com/news/insight/subscribe.cfm>
>>> ***
>>>
>>>  ****
>>>
>>> ** **
>>>
>>> ============================================================
>>> The information contained in this message may be privileged
>>> and confidential and protected from disclosure. If the reader
>>> of this message is not the intended recipient, or an employee
>>> or agent responsible for delivering this message to the
>>> intended recipient, you are hereby notified that any reproduction,
>>> dissemination or distribution of this communication is strictly
>>> prohibited. If you have received this communication in error,
>>> please notify us immediately by replying to the message and
>>> deleting it from your computer. Thank you. Tellabs
>>> ============================================================
>>>
>>
>>
>

Re: ElasticSearch Oddities

Posted by Karl Wright <da...@gmail.com>.
CONNECTORS-707 and CONNECTORS-708.

Karl



On Fri, Jun 7, 2013 at 12:48 PM, Karl Wright <da...@gmail.com> wrote:

> >>>>>>
> 1)      I didn’t set the “Allowed MIME Types” on the ES tab in the job to
> allow “application/xml”.  I was expecting to have all of the rows filtered
> out.  That didn’t happen.  All rows returned were indexed by ES anyway.
> <<<<<<
>
> That's probably because the JDBC connector does not call the appropriate
> method to check whether the mimetype will be accepted by the output
> connector or not.  It's up to the repository connector to do this, and is
> optional.  But this is worth creating a ticket for I think.
>
>
> >>>>>>
>  2)      Some of the columns (which are of type nvarchar) have embedded
> linefeed and/or return characters in them (e.g. mult-line addresses).
> These are getting flagged as JSON errors by ES (as containing an ‘unescaped
> character’).  I see that ElasticSearchIndex::
>
> jsonStringEscape() doesn’t deal with non-printable characters.  Should it?
>
> <<<<<<
>
>
> Yes.  This one definitely should have a ticket.
>
>
> Karl
>
>
>
>
> On Fri, Jun 7, 2013 at 12:43 PM, Nichols, Richard <
> Richard.Nichols@tellabs.com> wrote:
>
>>  Karl,****
>>
>> ** **
>>
>> Now that we have MCF sending documents to ES so that they are properly
>> being scanned, I’m finding a couple of oddities.****
>>
>> ** **
>>
>> I’m using the JDBC connector to feed ES, where the main ‘document’
>> (identified by the $(DATACOLUMN) variable) is in XML.  Therefore, I set the
>> $(CONTENTTYPE) column to ‘application/xml’.   Generally, this works.  But…
>> ****
>>
>> ** **
>>
>> **1)      **I didn’t set the “Allowed MIME Types” on the ES tab in the
>> job to allow “application/xml”.  I was expecting to have all of the rows
>> filtered out.  That didn’t happen.  All rows returned were indexed by ES
>> anyway.****
>>
>> **2)      **Some of the columns (which are of type nvarchar) have
>> embedded linefeed and/or return characters in them (e.g. mult-line
>> addresses).  These are getting flagged as JSON errors by ES (as containing
>> an ‘unescaped character’).  I see that
>> ElasticSearchIndex::jsonStringEscape() doesn’t deal with non-printable
>> characters.  Should it?****
>>
>> ** **
>>
>> Regards,****
>>
>> Rick****
>>
>> ** **
>>
>> Richard D. Nichols****
>>
>> Staff Engineer****
>>
>> Tellabs, Inc.****
>>
>> 18583 N. Dallas Parkway****
>>
>> Dallas, TX  75287****
>>
>> Office: (972) 588-6942****
>>
>> richard.nichols@tellabs.com****
>>
>> [image: Tellabs] <http://www.tellabs.com/>[image: TellabsTwitter]<http://www.twitter.com/tellabs>[image:
>> TellabsBlog] <http://www.tellabs.com/blog>****
>>
>> Want the latest news on what’s driving the telecom industry? *Subscribe
>> to Tellabs Insight Magazine<http://www.tellabs.com/news/insight/subscribe.cfm>
>> ***
>>
>>  ****
>>
>> ** **
>>
>> ============================================================
>> The information contained in this message may be privileged
>> and confidential and protected from disclosure. If the reader
>> of this message is not the intended recipient, or an employee
>> or agent responsible for delivering this message to the
>> intended recipient, you are hereby notified that any reproduction,
>> dissemination or distribution of this communication is strictly
>> prohibited. If you have received this communication in error,
>> please notify us immediately by replying to the message and
>> deleting it from your computer. Thank you. Tellabs
>> ============================================================
>>
>
>

Re: ElasticSearch Oddities

Posted by Karl Wright <da...@gmail.com>.
>>>>>>
1)      I didn’t set the “Allowed MIME Types” on the ES tab in the job to
allow “application/xml”.  I was expecting to have all of the rows filtered
out.  That didn’t happen.  All rows returned were indexed by ES anyway.
<<<<<<

That's probably because the JDBC connector does not call the appropriate
method to check whether the mimetype will be accepted by the output
connector or not.  It's up to the repository connector to do this, and is
optional.  But this is worth creating a ticket for I think.

>>>>>>
 2)      Some of the columns (which are of type nvarchar) have embedded
linefeed and/or return characters in them (e.g. mult-line addresses).
These are getting flagged as JSON errors by ES (as containing an ‘unescaped
character’).  I see that ElasticSearchIndex::

jsonStringEscape() doesn’t deal with non-printable characters.  Should it?

<<<<<<


Yes.  This one definitely should have a ticket.


Karl




On Fri, Jun 7, 2013 at 12:43 PM, Nichols, Richard <
Richard.Nichols@tellabs.com> wrote:

>  Karl,****
>
> ** **
>
> Now that we have MCF sending documents to ES so that they are properly
> being scanned, I’m finding a couple of oddities.****
>
> ** **
>
> I’m using the JDBC connector to feed ES, where the main ‘document’
> (identified by the $(DATACOLUMN) variable) is in XML.  Therefore, I set the
> $(CONTENTTYPE) column to ‘application/xml’.   Generally, this works.  But…
> ****
>
> ** **
>
> **1)      **I didn’t set the “Allowed MIME Types” on the ES tab in the
> job to allow “application/xml”.  I was expecting to have all of the rows
> filtered out.  That didn’t happen.  All rows returned were indexed by ES
> anyway.****
>
> **2)      **Some of the columns (which are of type nvarchar) have
> embedded linefeed and/or return characters in them (e.g. mult-line
> addresses).  These are getting flagged as JSON errors by ES (as containing
> an ‘unescaped character’).  I see that
> ElasticSearchIndex::jsonStringEscape() doesn’t deal with non-printable
> characters.  Should it?****
>
> ** **
>
> Regards,****
>
> Rick****
>
> ** **
>
> Richard D. Nichols****
>
> Staff Engineer****
>
> Tellabs, Inc.****
>
> 18583 N. Dallas Parkway****
>
> Dallas, TX  75287****
>
> Office: (972) 588-6942****
>
> richard.nichols@tellabs.com****
>
> [image: Tellabs] <http://www.tellabs.com/>[image: TellabsTwitter]<http://www.twitter.com/tellabs>[image:
> TellabsBlog] <http://www.tellabs.com/blog>****
>
> Want the latest news on what’s driving the telecom industry? *Subscribe
> to Tellabs Insight Magazine<http://www.tellabs.com/news/insight/subscribe.cfm>
> ***
>
>  ****
>
> ** **
>
> ============================================================
> The information contained in this message may be privileged
> and confidential and protected from disclosure. If the reader
> of this message is not the intended recipient, or an employee
> or agent responsible for delivering this message to the
> intended recipient, you are hereby notified that any reproduction,
> dissemination or distribution of this communication is strictly
> prohibited. If you have received this communication in error,
> please notify us immediately by replying to the message and
> deleting it from your computer. Thank you. Tellabs
> ============================================================
>