You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Juan Pablo Diaz-Vaz <jp...@mcplusa.com> on 2016/02/08 19:28:04 UTC

Amazon CloudSearch Connector question

Hi,

I've successfully sent data to FileSystems and SOLR, but for Amazon
CloudSearch I'm seeing that only empty messages are being sent to my
domain. I think this may be an issue on how I've setup the TIKA Extractor
Transformation or the field mapping. I think the Database where the records
are supposed to be stored before flushing to Amazon, is storing empty
content.

I've tried to find documentation on how to setup the TIKA Transformation,
but I haven't been able to find any.

If someone could provide an example of a job setup to send from a
FileSystem to CloudSearch, that'd be great!

Thanks in advance,

-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

Sure; please blow away the database instance first, and then you should be
all set.

Karl


On Tue, Feb 9, 2016 at 9:43 AM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> I'm using the quick start, I'll try to do a fresh start.
>
> On Tue, Feb 9, 2016 at 11:42 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Juan,
>>
>> It occurs to me that you may have records in the document chunk table
>> that were corrupted by the earlier version of the connector, and that is
>> what is being sent.  Are you using the quick-start example, or Postgres?
>> If postgres, I'd recommend just deleting all rows in the document chunk
>> table.
>>
>> Karl
>>
>>
>> On Tue, Feb 9, 2016 at 9:13 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> This is a puzzle; the only way this could occur is if some of the
>>> records being produced generated absolutely no JSON.  Since there is an ID
>>> and a type record for all of them I can't see how this could happen.  So we
>>> must be adding records for documents that don't exist somehow?  I'll have
>>> to look into it.
>>>
>>> Karl
>>>
>>> On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> The patch worked and now at least the POST has content. Amazon is
>>>> responding with a Parsing Error though.
>>>>
>>>> I logged the message before it gets posted to Amazon and it's not a
>>>> valid JSON, it had extra commas and parenthesis characters when
>>>> concatenating records. Don't know if this is an issue on my setup or
>>>> the JSONArrayReader.
>>>>
>>>> [{
>>>> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
>>>> "type": "add",
>>>> "fields": {
>>>>  <record fields>
>>>> }
>>>> }, , {
>>>> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
>>>> "type": "add",
>>>> "fields": {
>>>>  <record fields>
>>>> }
>>>> }, , , , , , , , , , , , , , , , {
>>>> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
>>>> "type": "add",
>>>> "fields": {
>>>>  <record fields>
>>>> }
>>>> }, , , ]
>>>>
>>>> Thanks,
>>>>
>>>> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Thanks! I'll apply it and let you know how it goes.
>>>>>
>>>>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our
>>>>>> code, not Commons-IO, but Commons-IO changed things so that it tweaked it.
>>>>>>
>>>>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have chased this down to a completely broken Apache Commons-IO
>>>>>>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>>>>>>> all, and refuses to read anything from them.  Unfortunately I can't change
>>>>>>> versions of that library because other things depend upon it. So I'll need
>>>>>>> to write my own code to replace its functionality.  That will take some
>>>>>>> amount of time to do.
>>>>>>>
>>>>>>> This probably happened the last time our dependencies were updated.
>>>>>>> My apologies.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Don't know if it'll help, but removing the usage of
>>>>>>>> JSONObjectReader on addOrReplaceDocumentWithException and posting to Amazon
>>>>>>>> chunk-by-chunk instead of using the JSONArrayReader on flushDocuments,
>>>>>>>> changed the error I was getting from Amazon.
>>>>>>>>
>>>>>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>>>>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you very much.
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>>>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>>>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>>>>>>
>>>>>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>>>>
>>>>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you can possibly include a snippet of the JSON you are
>>>>>>>>>>>>> seeing on the Amazon end, that would be great.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When running a copy of the job, but with SOLR as a target,
>>>>>>>>>>>>>>> I'm seeing the expected content being posted to SOLR, so it may not be an
>>>>>>>>>>>>>>> issue with TIKA. After adding some more logging to the CloudSearch
>>>>>>>>>>>>>>> connector, I think the data is getting lost just before passing it to the
>>>>>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible
>>>>>>>>>>>>>>>> using a solr output connection.  If you include the tika extractor in the
>>>>>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not use the
>>>>>>>>>>>>>>>> extracting update handler.  There's a checkbox on the Schema tab you need
>>>>>>>>>>>>>>>> to uncheck for that.  But if you do that you can see what is being sent to
>>>>>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages dumped to solr
>>>>>>>>>>>>>>>> log.  This should help you figure out if the problem is your tika
>>>>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but
>>>>>>>>>>>>>>>>> for Amazon CloudSearch I'm seeing that only empty messages are being sent
>>>>>>>>>>>>>>>>> to my domain. I think this may be an issue on how I've setup the TIKA
>>>>>>>>>>>>>>>>> Extractor Transformation or the field mapping. I think the Database where
>>>>>>>>>>>>>>>>> the records are supposed to be stored before flushing to Amazon, is storing
>>>>>>>>>>>>>>>>> empty content.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If someone could provide an example of a job setup to send
>>>>>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

I'm using the quick start, I'll try to do a fresh start.

On Tue, Feb 9, 2016 at 11:42 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Juan,
>
> It occurs to me that you may have records in the document chunk table that
> were corrupted by the earlier version of the connector, and that is what is
> being sent.  Are you using the quick-start example, or Postgres?  If
> postgres, I'd recommend just deleting all rows in the document chunk table.
>
> Karl
>
>
> On Tue, Feb 9, 2016 at 9:13 AM, Karl Wright <da...@gmail.com> wrote:
>
>> This is a puzzle; the only way this could occur is if some of the records
>> being produced generated absolutely no JSON.  Since there is an ID and a
>> type record for all of them I can't see how this could happen.  So we must
>> be adding records for documents that don't exist somehow?  I'll have to
>> look into it.
>>
>> Karl
>>
>> On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Hi,
>>>
>>> The patch worked and now at least the POST has content. Amazon is
>>> responding with a Parsing Error though.
>>>
>>> I logged the message before it gets posted to Amazon and it's not a
>>> valid JSON, it had extra commas and parenthesis characters when
>>> concatenating records. Don't know if this is an issue on my setup or
>>> the JSONArrayReader.
>>>
>>> [{
>>> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
>>> "type": "add",
>>> "fields": {
>>>  <record fields>
>>> }
>>> }, , {
>>> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
>>> "type": "add",
>>> "fields": {
>>>  <record fields>
>>> }
>>> }, , , , , , , , , , , , , , , , {
>>> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
>>> "type": "add",
>>> "fields": {
>>>  <record fields>
>>> }
>>> }, , , ]
>>>
>>> Thanks,
>>>
>>> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Thanks! I'll apply it and let you know how it goes.
>>>>
>>>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our
>>>>> code, not Commons-IO, but Commons-IO changed things so that it tweaked it.
>>>>>
>>>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have chased this down to a completely broken Apache Commons-IO
>>>>>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>>>>>> all, and refuses to read anything from them.  Unfortunately I can't change
>>>>>> versions of that library because other things depend upon it. So I'll need
>>>>>> to write my own code to replace its functionality.  That will take some
>>>>>> amount of time to do.
>>>>>>
>>>>>> This probably happened the last time our dependencies were updated.
>>>>>> My apologies.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Don't know if it'll help, but removing the usage of JSONObjectReader
>>>>>>> on addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>>>>>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>>>>>>> was getting from Amazon.
>>>>>>>
>>>>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>>>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you very much.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>>>>>
>>>>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>>>
>>>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If you can possibly include a snippet of the JSON you are
>>>>>>>>>>>> seeing on the Amazon end, that would be great.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When running a copy of the job, but with SOLR as a target,
>>>>>>>>>>>>>> I'm seeing the expected content being posted to SOLR, so it may not be an
>>>>>>>>>>>>>> issue with TIKA. After adding some more logging to the CloudSearch
>>>>>>>>>>>>>> connector, I think the data is getting lost just before passing it to the
>>>>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible
>>>>>>>>>>>>>>> using a solr output connection.  If you include the tika extractor in the
>>>>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not use the
>>>>>>>>>>>>>>> extracting update handler.  There's a checkbox on the Schema tab you need
>>>>>>>>>>>>>>> to uncheck for that.  But if you do that you can see what is being sent to
>>>>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages dumped to solr
>>>>>>>>>>>>>>> log.  This should help you figure out if the problem is your tika
>>>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but
>>>>>>>>>>>>>>>> for Amazon CloudSearch I'm seeing that only empty messages are being sent
>>>>>>>>>>>>>>>> to my domain. I think this may be an issue on how I've setup the TIKA
>>>>>>>>>>>>>>>> Extractor Transformation or the field mapping. I think the Database where
>>>>>>>>>>>>>>>> the records are supposed to be stored before flushing to Amazon, is storing
>>>>>>>>>>>>>>>> empty content.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If someone could provide an example of a job setup to send
>>>>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

Hi Juan,

It occurs to me that you may have records in the document chunk table that
were corrupted by the earlier version of the connector, and that is what is
being sent.  Are you using the quick-start example, or Postgres?  If
postgres, I'd recommend just deleting all rows in the document chunk table.

Karl


On Tue, Feb 9, 2016 at 9:13 AM, Karl Wright <da...@gmail.com> wrote:

> This is a puzzle; the only way this could occur is if some of the records
> being produced generated absolutely no JSON.  Since there is an ID and a
> type record for all of them I can't see how this could happen.  So we must
> be adding records for documents that don't exist somehow?  I'll have to
> look into it.
>
> Karl
>
> On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Hi,
>>
>> The patch worked and now at least the POST has content. Amazon is
>> responding with a Parsing Error though.
>>
>> I logged the message before it gets posted to Amazon and it's not a valid
>> JSON, it had extra commas and parenthesis characters when concatenating
>> records. Don't know if this is an issue on my setup or the JSONArrayReader.
>>
>> [{
>> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
>> "type": "add",
>> "fields": {
>>  <record fields>
>> }
>> }, , {
>> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
>> "type": "add",
>> "fields": {
>>  <record fields>
>> }
>> }, , , , , , , , , , , , , , , , {
>> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
>> "type": "add",
>> "fields": {
>>  <record fields>
>> }
>> }, , , ]
>>
>> Thanks,
>>
>> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Thanks! I'll apply it and let you know how it goes.
>>>
>>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our code,
>>>> not Commons-IO, but Commons-IO changed things so that it tweaked it.
>>>>
>>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> I have chased this down to a completely broken Apache Commons-IO
>>>>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>>>>> all, and refuses to read anything from them.  Unfortunately I can't change
>>>>> versions of that library because other things depend upon it. So I'll need
>>>>> to write my own code to replace its functionality.  That will take some
>>>>> amount of time to do.
>>>>>
>>>>> This probably happened the last time our dependencies were updated.
>>>>> My apologies.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Don't know if it'll help, but removing the usage of JSONObjectReader
>>>>>> on addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>>>>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>>>>>> was getting from Amazon.
>>>>>>
>>>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>>>>> problem.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> Thank you very much.
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>>>>
>>>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>>
>>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If you can possibly include a snippet of the JSON you are seeing
>>>>>>>>>>> on the Amazon end, that would be great.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>>
>>>>>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>>>>>>>> think the data is getting lost just before passing it to the
>>>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible
>>>>>>>>>>>>>> using a solr output connection.  If you include the tika extractor in the
>>>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not use the
>>>>>>>>>>>>>> extracting update handler.  There's a checkbox on the Schema tab you need
>>>>>>>>>>>>>> to uncheck for that.  But if you do that you can see what is being sent to
>>>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages dumped to solr
>>>>>>>>>>>>>> log.  This should help you figure out if the problem is your tika
>>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>>>>>>>> content.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If someone could provide an example of a job setup to send
>>>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>> Full Stack Developer - MC+A Chile
>>>>>> +56 9 84265890
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

This is a puzzle; the only way this could occur is if some of the records
being produced generated absolutely no JSON.  Since there is an ID and a
type record for all of them I can't see how this could happen.  So we must
be adding records for documents that don't exist somehow?  I'll have to
look into it.

Karl

On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> Hi,
>
> The patch worked and now at least the POST has content. Amazon is
> responding with a Parsing Error though.
>
> I logged the message before it gets posted to Amazon and it's not a valid
> JSON, it had extra commas and parenthesis characters when concatenating
> records. Don't know if this is an issue on my setup or the JSONArrayReader.
>
> [{
> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
> "type": "add",
> "fields": {
>  <record fields>
> }
> }, , {
> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
> "type": "add",
> "fields": {
>  <record fields>
> }
> }, , , , , , , , , , , , , , , , {
> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
> "type": "add",
> "fields": {
>  <record fields>
> }
> }, , , ]
>
> Thanks,
>
> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Thanks! I'll apply it and let you know how it goes.
>>
>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our code,
>>> not Commons-IO, but Commons-IO changed things so that it tweaked it.
>>>
>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> I have chased this down to a completely broken Apache Commons-IO
>>>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>>>> all, and refuses to read anything from them.  Unfortunately I can't change
>>>> versions of that library because other things depend upon it. So I'll need
>>>> to write my own code to replace its functionality.  That will take some
>>>> amount of time to do.
>>>>
>>>> This probably happened the last time our dependencies were updated.  My
>>>> apologies.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Thanks,
>>>>>
>>>>> Don't know if it'll help, but removing the usage of JSONObjectReader
>>>>> on addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>>>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>>>>> was getting from Amazon.
>>>>>
>>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>>
>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>>>> problem.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>>>
>>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>
>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> If you can possibly include a snippet of the JSON you are seeing
>>>>>>>>>> on the Amazon end, that would be great.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>
>>>>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>>>>>>> think the data is getting lost just before passing it to the
>>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using
>>>>>>>>>>>>> a solr output connection.  If you include the tika extractor in the
>>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not use the
>>>>>>>>>>>>> extracting update handler.  There's a checkbox on the Schema tab you need
>>>>>>>>>>>>> to uncheck for that.  But if you do that you can see what is being sent to
>>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages dumped to solr
>>>>>>>>>>>>> log.  This should help you figure out if the problem is your tika
>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>>>>>>> content.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If someone could provide an example of a job setup to send
>>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

Hi,

The patch worked and now at least the POST has content. Amazon is
responding with a Parsing Error though.

I logged the message before it gets posted to Amazon and it's not a valid
JSON, it had extra commas and parenthesis characters when concatenating
records. Don't know if this is an issue on my setup or the JSONArrayReader.

[{
"id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
"type": "add",
"fields": {
 <record fields>
}
}, , {
"id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
"type": "add",
"fields": {
 <record fields>
}
}, , , , , , , , , , , , , , , , {
"id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
"type": "add",
"fields": {
 <record fields>
}
}, , , ]

Thanks,

On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> Thanks! I'll apply it and let you know how it goes.
>
> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our code,
>> not Commons-IO, but Commons-IO changed things so that it tweaked it.
>>
>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>
>> Thanks!
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> I have chased this down to a completely broken Apache Commons-IO
>>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>>> all, and refuses to read anything from them.  Unfortunately I can't change
>>> versions of that library because other things depend upon it. So I'll need
>>> to write my own code to replace its functionality.  That will take some
>>> amount of time to do.
>>>
>>> This probably happened the last time our dependencies were updated.  My
>>> apologies.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Thanks,
>>>>
>>>> Don't know if it'll help, but removing the usage of JSONObjectReader on
>>>> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>>>> was getting from Amazon.
>>>>
>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>
>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>>> problem.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>>
>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>
>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> If you can possibly include a snippet of the JSON you are seeing
>>>>>>>>> on the Amazon end, that would be great.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>
>>>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>>>>>> think the data is getting lost just before passing it to the
>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using
>>>>>>>>>>>> a solr output connection.  If you include the tika extractor in the
>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not use the
>>>>>>>>>>>> extracting update handler.  There's a checkbox on the Schema tab you need
>>>>>>>>>>>> to uncheck for that.  But if you do that you can see what is being sent to
>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages dumped to solr
>>>>>>>>>>>> log.  This should help you figure out if the problem is your tika
>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>
>>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>>>>>> content.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If someone could provide an example of a job setup to send
>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>> Full Stack Developer - MC+A Chile
>>>>>> +56 9 84265890
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>



-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

Thanks! I'll apply it and let you know how it goes.

On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <da...@gmail.com> wrote:

> Ok, I have a patch.  It's actually pretty tiny; the bug is in our code,
> not Commons-IO, but Commons-IO changed things so that it tweaked it.
>
> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>
> Thanks!
> Karl
>
>
> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com> wrote:
>
>> I have chased this down to a completely broken Apache Commons-IO
>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>> all, and refuses to read anything from them.  Unfortunately I can't change
>> versions of that library because other things depend upon it. So I'll need
>> to write my own code to replace its functionality.  That will take some
>> amount of time to do.
>>
>> This probably happened the last time our dependencies were updated.  My
>> apologies.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Thanks,
>>>
>>> Don't know if it'll help, but removing the usage of JSONObjectReader on
>>> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>>> was getting from Amazon.
>>>
>>> Maybe those objects are failing on parsing the content to JSON.
>>>
>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>> problem.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Thank you very much.
>>>>>
>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>
>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>
>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If you can possibly include a snippet of the JSON you are seeing on
>>>>>>>> the Amazon end, that would be great.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> More likely this is a bug.
>>>>>>>>>
>>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>>>>> think the data is getting lost just before passing it to the
>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>
>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using a
>>>>>>>>>>> solr output connection.  If you include the tika extractor in the pipeline,
>>>>>>>>>>> you will want to configure the solr connection to not use the extracting
>>>>>>>>>>> update handler.  There's a checkbox on the Schema tab you need to uncheck
>>>>>>>>>>> for that.  But if you do that you can see what is being sent to Solr pretty
>>>>>>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>>>>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>>>>>>>
>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>>>>> content.
>>>>>>>>>>>>
>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>>
>>>>>>>>>>>> If someone could provide an example of a job setup to send from
>>>>>>>>>>>> a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

Ok, I have a patch.  It's actually pretty tiny; the bug is in our code, not
Commons-IO, but Commons-IO changed things so that it tweaked it.

I've created a ticket (CONNECTORS-1271) and attached the patch to it.

Thanks!
Karl


On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <da...@gmail.com> wrote:

> I have chased this down to a completely broken Apache Commons-IO library.
> It no longer works with the JSONReader objects in ManifoldCF at all, and
> refuses to read anything from them.  Unfortunately I can't change versions
> of that library because other things depend upon it. So I'll need to write
> my own code to replace its functionality.  That will take some amount of
> time to do.
>
> This probably happened the last time our dependencies were updated.  My
> apologies.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Thanks,
>>
>> Don't know if it'll help, but removing the usage of JSONObjectReader on
>> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>> was getting from Amazon.
>>
>> Maybe those objects are failing on parsing the content to JSON.
>>
>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>> through.  I'll have to open a ticket and create a patch when I find the
>>> problem.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Thank you very much.
>>>>
>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>
>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>>>>> fails when sending the request to amazon cloudsearch:
>>>>>>
>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>>> file\"] }", "deletes": 0}'
>>>>>>
>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> If you can possibly include a snippet of the JSON you are seeing on
>>>>>>> the Amazon end, that would be great.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> More likely this is a bug.
>>>>>>>>
>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>>>> think the data is getting lost just before passing it to the
>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Juan,
>>>>>>>>>>
>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using a
>>>>>>>>>> solr output connection.  If you include the tika extractor in the pipeline,
>>>>>>>>>> you will want to configure the solr connection to not use the extracting
>>>>>>>>>> update handler.  There's a checkbox on the Schema tab you need to uncheck
>>>>>>>>>> for that.  But if you do that you can see what is being sent to Solr pretty
>>>>>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>>>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>>>>>>
>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>>>> content.
>>>>>>>>>>>
>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>>
>>>>>>>>>>> If someone could provide an example of a job setup to send from
>>>>>>>>>>> a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>> Full Stack Developer - MC+A Chile
>>>>>> +56 9 84265890
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

I have chased this down to a completely broken Apache Commons-IO library.
It no longer works with the JSONReader objects in ManifoldCF at all, and
refuses to read anything from them.  Unfortunately I can't change versions
of that library because other things depend upon it. So I'll need to write
my own code to replace its functionality.  That will take some amount of
time to do.

This probably happened the last time our dependencies were updated.  My
apologies.

Karl


On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> Thanks,
>
> Don't know if it'll help, but removing the usage of JSONObjectReader on
> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
> instead of using the JSONArrayReader on flushDocuments, changed the error I
> was getting from Amazon.
>
> Maybe those objects are failing on parsing the content to JSON.
>
> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Ok, I'm debugging away, and I can confirm that no data is getting
>> through.  I'll have to open a ticket and create a patch when I find the
>> problem.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Thank you very much.
>>>
>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
>>>> about the JSON format we are sending it.  The deprecation message is
>>>> probably a strong clue.  I'll experiment here with logging document
>>>> contents so that I can give you further advice.  Stay tuned.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>>>> fails when sending the request to amazon cloudsearch:
>>>>>
>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>>> file\"] }", "deletes": 0}'
>>>>>
>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If you can possibly include a snippet of the JSON you are seeing on
>>>>>> the Amazon end, that would be great.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> More likely this is a bug.
>>>>>>>
>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>>> think the data is getting lost just before passing it to the
>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Juan,
>>>>>>>>>
>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using a
>>>>>>>>> solr output connection.  If you include the tika extractor in the pipeline,
>>>>>>>>> you will want to configure the solr connection to not use the extracting
>>>>>>>>> update handler.  There's a checkbox on the Schema tab you need to uncheck
>>>>>>>>> for that.  But if you do that you can see what is being sent to Solr pretty
>>>>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>>>>>
>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>>> content.
>>>>>>>>>>
>>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>
>>>>>>>>>> If someone could provide an example of a job setup to send from a
>>>>>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

Thanks,

Don't know if it'll help, but removing the usage of JSONObjectReader on
addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
instead of using the JSONArrayReader on flushDocuments, changed the error I
was getting from Amazon.

Maybe those objects are failing on parsing the content to JSON.

On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <da...@gmail.com> wrote:

> Ok, I'm debugging away, and I can confirm that no data is getting
> through.  I'll have to open a ticket and create a patch when I find the
> problem.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Thank you very much.
>>
>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
>>> about the JSON format we are sending it.  The deprecation message is
>>> probably a strong clue.  I'll experiment here with logging document
>>> contents so that I can give you further advice.  Stay tuned.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>>> fails when sending the request to amazon cloudsearch:
>>>>
>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status": "error",
>>>> "errors": [{"message": "[*Deprecated*: Use the outer message field]
>>>> Encountered unexpected end of file"}], "adds": 0, "__type":
>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>>> file\"] }", "deletes": 0}'
>>>>
>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> If you can possibly include a snippet of the JSON you are seeing on
>>>>> the Amazon end, that would be great.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> More likely this is a bug.
>>>>>>
>>>>>> I take it that it is the body string that is not coming out,
>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the body
>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>> seeing the expected content being posted to SOLR, so it may not be an issue
>>>>>>> with TIKA. After adding some more logging to the CloudSearch connector, I
>>>>>>> think the data is getting lost just before passing it to the
>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Juan,
>>>>>>>>
>>>>>>>> I'd try to reproduce as much of the pipeline as possible using a
>>>>>>>> solr output connection.  If you include the tika extractor in the pipeline,
>>>>>>>> you will want to configure the solr connection to not use the extracting
>>>>>>>> update handler.  There's a checkbox on the Schema tab you need to uncheck
>>>>>>>> for that.  But if you do that you can see what is being sent to Solr pretty
>>>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>>>>
>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for
>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>>> content.
>>>>>>>>>
>>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>
>>>>>>>>> If someone could provide an example of a job setup to send from a
>>>>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

Ok, I'm debugging away, and I can confirm that no data is getting through.
I'll have to open a ticket and create a patch when I find the problem.

Karl


On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> Thank you very much.
>
> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
>> about the JSON format we are sending it.  The deprecation message is
>> probably a strong clue.  I'll experiment here with logging document
>> contents so that I can give you further advice.  Stay tuned.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>> fails when sending the request to amazon cloudsearch:
>>>
>>> AmazonCloudSearch: Error sending document chunk 0: '{"status": "error",
>>> "errors": [{"message": "[*Deprecated*: Use the outer message field]
>>> Encountered unexpected end of file"}], "adds": 0, "__type":
>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>>> file\"] }", "deletes": 0}'
>>>
>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>
>>>
>>>
>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> If you can possibly include a snippet of the JSON you are seeing on the
>>>> Amazon end, that would be great.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> More likely this is a bug.
>>>>>
>>>>> I take it that it is the body string that is not coming out, correct?
>>>>> Do all the other JSON fields look reasonable?  Does the body clause exist
>>>>> and is just empty, or is it not there at all?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> When running a copy of the job, but with SOLR as a target, I'm seeing
>>>>>> the expected content being posted to SOLR, so it may not be an issue with
>>>>>> TIKA. After adding some more logging to the CloudSearch connector, I think
>>>>>> the data is getting lost just before passing it to the
>>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Juan,
>>>>>>>
>>>>>>> I'd try to reproduce as much of the pipeline as possible using a
>>>>>>> solr output connection.  If you include the tika extractor in the pipeline,
>>>>>>> you will want to configure the solr connection to not use the extracting
>>>>>>> update handler.  There's a checkbox on the Schema tab you need to uncheck
>>>>>>> for that.  But if you do that you can see what is being sent to Solr pretty
>>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>>>
>>>>>>> Please give this a try and let me know what happens.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>>>>>>>> CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>>> content.
>>>>>>>>
>>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>
>>>>>>>> If someone could provide an example of a job setup to send from a
>>>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>> Full Stack Developer - MC+A Chile
>>>>>> +56 9 84265890
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

Thank you very much.

On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <da...@gmail.com> wrote:

> Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
> about the JSON format we are sending it.  The deprecation message is
> probably a strong clue.  I'll experiment here with logging document
> contents so that I can give you further advice.  Stay tuned.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>> fails when sending the request to amazon cloudsearch:
>>
>> AmazonCloudSearch: Error sending document chunk 0: '{"status": "error",
>> "errors": [{"message": "[*Deprecated*: Use the outer message field]
>> Encountered unexpected end of file"}], "adds": 0, "__type":
>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
>> file\"] }", "deletes": 0}'
>>
>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>
>>
>>
>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> If you can possibly include a snippet of the JSON you are seeing on the
>>> Amazon end, that would be great.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> More likely this is a bug.
>>>>
>>>> I take it that it is the body string that is not coming out, correct?
>>>> Do all the other JSON fields look reasonable?  Does the body clause exist
>>>> and is just empty, or is it not there at all?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> When running a copy of the job, but with SOLR as a target, I'm seeing
>>>>> the expected content being posted to SOLR, so it may not be an issue with
>>>>> TIKA. After adding some more logging to the CloudSearch connector, I think
>>>>> the data is getting lost just before passing it to the
>>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Juan,
>>>>>>
>>>>>> I'd try to reproduce as much of the pipeline as possible using a solr
>>>>>> output connection.  If you include the tika extractor in the pipeline, you
>>>>>> will want to configure the solr connection to not use the extracting update
>>>>>> handler.  There's a checkbox on the Schema tab you need to uncheck for
>>>>>> that.  But if you do that you can see what is being sent to Solr pretty
>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>>
>>>>>> Please give this a try and let me know what happens.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>>>>>>> CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>>> content.
>>>>>>>
>>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>
>>>>>>> If someone could provide an example of a job setup to send from a
>>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
about the JSON format we are sending it.  The deprecation message is
probably a strong clue.  I'll experiment here with logging document
contents so that I can give you further advice.  Stay tuned.

Karl


On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> I'm actually not seeing anything on Amazon. The CloudSearch connector
> fails when sending the request to amazon cloudsearch:
>
> AmazonCloudSearch: Error sending document chunk 0: '{"status": "error",
> "errors": [{"message": "[*Deprecated*: Use the outer message field]
> Encountered unexpected end of file"}], "adds": 0, "__type":
> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
> file\"] }", "deletes": 0}'
>
> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>
>
>
> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com> wrote:
>
>> If you can possibly include a snippet of the JSON you are seeing on the
>> Amazon end, that would be great.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> More likely this is a bug.
>>>
>>> I take it that it is the body string that is not coming out, correct?
>>> Do all the other JSON fields look reasonable?  Does the body clause exist
>>> and is just empty, or is it not there at all?
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> When running a copy of the job, but with SOLR as a target, I'm seeing
>>>> the expected content being posted to SOLR, so it may not be an issue with
>>>> TIKA. After adding some more logging to the CloudSearch connector, I think
>>>> the data is getting lost just before passing it to the
>>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>>> be that the JSONObjectReader doesn't like my data?
>>>>
>>>> Thanks,
>>>>
>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Hi Juan,
>>>>>
>>>>> I'd try to reproduce as much of the pipeline as possible using a solr
>>>>> output connection.  If you include the tika extractor in the pipeline, you
>>>>> will want to configure the solr connection to not use the extracting update
>>>>> handler.  There's a checkbox on the Schema tab you need to uncheck for
>>>>> that.  But if you do that you can see what is being sent to Solr pretty
>>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>>> should help you figure out if the problem is your tika configuration or not.
>>>>>
>>>>> Please give this a try and let me know what happens.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>>>>>> CloudSearch I'm seeing that only empty messages are being sent to my
>>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>>> Transformation or the field mapping. I think the Database where the records
>>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>>> content.
>>>>>>
>>>>>> I've tried to find documentation on how to setup the TIKA
>>>>>> Transformation, but I haven't been able to find any.
>>>>>>
>>>>>> If someone could provide an example of a job setup to send from a
>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> --
>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>> Full Stack Developer - MC+A Chile
>>>>>> +56 9 84265890
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

I'm actually not seeing anything on Amazon. The CloudSearch connector fails
when sending the request to amazon cloudsearch:

AmazonCloudSearch: Error sending document chunk 0: '{"status": "error",
"errors": [{"message": "[*Deprecated*: Use the outer message field]
Encountered unexpected end of file"}], "adds": 0, "__type":
"#DocumentServiceException", "message": "{ [\"Encountered unexpected end of
file\"] }", "deletes": 0}'

ERROR 2016-02-08 20:04:16,544 (Job notification thread) -



On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <da...@gmail.com> wrote:

> If you can possibly include a snippet of the JSON you are seeing on the
> Amazon end, that would be great.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com> wrote:
>
>> More likely this is a bug.
>>
>> I take it that it is the body string that is not coming out, correct?  Do
>> all the other JSON fields look reasonable?  Does the body clause exist and
>> is just empty, or is it not there at all?
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Hi,
>>>
>>> When running a copy of the job, but with SOLR as a target, I'm seeing
>>> the expected content being posted to SOLR, so it may not be an issue with
>>> TIKA. After adding some more logging to the CloudSearch connector, I think
>>> the data is getting lost just before passing it to the
>>> DocumentChunkManager, which inserts the empty records to the DB. Could it
>>> be that the JSONObjectReader doesn't like my data?
>>>
>>> Thanks,
>>>
>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Juan,
>>>>
>>>> I'd try to reproduce as much of the pipeline as possible using a solr
>>>> output connection.  If you include the tika extractor in the pipeline, you
>>>> will want to configure the solr connection to not use the extracting update
>>>> handler.  There's a checkbox on the Schema tab you need to uncheck for
>>>> that.  But if you do that you can see what is being sent to Solr pretty
>>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>>> should help you figure out if the problem is your tika configuration or not.
>>>>
>>>> Please give this a try and let me know what happens.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>>>>> CloudSearch I'm seeing that only empty messages are being sent to my
>>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>>> Transformation or the field mapping. I think the Database where the records
>>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>>> content.
>>>>>
>>>>> I've tried to find documentation on how to setup the TIKA
>>>>> Transformation, but I haven't been able to find any.
>>>>>
>>>>> If someone could provide an example of a job setup to send from a
>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

If you can possibly include a snippet of the JSON you are seeing on the
Amazon end, that would be great.

Karl


On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <da...@gmail.com> wrote:

> More likely this is a bug.
>
> I take it that it is the body string that is not coming out, correct?  Do
> all the other JSON fields look reasonable?  Does the body clause exist and
> is just empty, or is it not there at all?
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Hi,
>>
>> When running a copy of the job, but with SOLR as a target, I'm seeing the
>> expected content being posted to SOLR, so it may not be an issue with TIKA.
>> After adding some more logging to the CloudSearch connector, I think the
>> data is getting lost just before passing it to the DocumentChunkManager,
>> which inserts the empty records to the DB. Could it be that the
>> JSONObjectReader doesn't like my data?
>>
>> Thanks,
>>
>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Juan,
>>>
>>> I'd try to reproduce as much of the pipeline as possible using a solr
>>> output connection.  If you include the tika extractor in the pipeline, you
>>> will want to configure the solr connection to not use the extracting update
>>> handler.  There's a checkbox on the Schema tab you need to uncheck for
>>> that.  But if you do that you can see what is being sent to Solr pretty
>>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>>> should help you figure out if the problem is your tika configuration or not.
>>>
>>> Please give this a try and let me know what happens.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>>>> CloudSearch I'm seeing that only empty messages are being sent to my
>>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>>> Transformation or the field mapping. I think the Database where the records
>>>> are supposed to be stored before flushing to Amazon, is storing empty
>>>> content.
>>>>
>>>> I've tried to find documentation on how to setup the TIKA
>>>> Transformation, but I haven't been able to find any.
>>>>
>>>> If someone could provide an example of a job setup to send from a
>>>> FileSystem to CloudSearch, that'd be great!
>>>>
>>>> Thanks in advance,
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

More likely this is a bug.

I take it that it is the body string that is not coming out, correct?  Do
all the other JSON fields look reasonable?  Does the body clause exist and
is just empty, or is it not there at all?

Karl


On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> Hi,
>
> When running a copy of the job, but with SOLR as a target, I'm seeing the
> expected content being posted to SOLR, so it may not be an issue with TIKA.
> After adding some more logging to the CloudSearch connector, I think the
> data is getting lost just before passing it to the DocumentChunkManager,
> which inserts the empty records to the DB. Could it be that the
> JSONObjectReader doesn't like my data?
>
> Thanks,
>
> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Juan,
>>
>> I'd try to reproduce as much of the pipeline as possible using a solr
>> output connection.  If you include the tika extractor in the pipeline, you
>> will want to configure the solr connection to not use the extracting update
>> handler.  There's a checkbox on the Schema tab you need to uncheck for
>> that.  But if you do that you can see what is being sent to Solr pretty
>> exactly; it all gets logged in the INFO messages dumped to solr log.  This
>> should help you figure out if the problem is your tika configuration or not.
>>
>> Please give this a try and let me know what happens.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Hi,
>>>
>>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>>> CloudSearch I'm seeing that only empty messages are being sent to my
>>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>>> Transformation or the field mapping. I think the Database where the records
>>> are supposed to be stored before flushing to Amazon, is storing empty
>>> content.
>>>
>>> I've tried to find documentation on how to setup the TIKA
>>> Transformation, but I haven't been able to find any.
>>>
>>> If someone could provide an example of a job setup to send from a
>>> FileSystem to CloudSearch, that'd be great!
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Re: Amazon CloudSearch Connector question

Posted by Juan Pablo Diaz-Vaz <jp...@mcplusa.com>.

Hi,

When running a copy of the job, but with SOLR as a target, I'm seeing the
expected content being posted to SOLR, so it may not be an issue with TIKA.
After adding some more logging to the CloudSearch connector, I think the
data is getting lost just before passing it to the DocumentChunkManager,
which inserts the empty records to the DB. Could it be that the
JSONObjectReader doesn't like my data?

Thanks,

On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Juan,
>
> I'd try to reproduce as much of the pipeline as possible using a solr
> output connection.  If you include the tika extractor in the pipeline, you
> will want to configure the solr connection to not use the extracting update
> handler.  There's a checkbox on the Schema tab you need to uncheck for
> that.  But if you do that you can see what is being sent to Solr pretty
> exactly; it all gets logged in the INFO messages dumped to solr log.  This
> should help you figure out if the problem is your tika configuration or not.
>
> Please give this a try and let me know what happens.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Hi,
>>
>> I've successfully sent data to FileSystems and SOLR, but for Amazon
>> CloudSearch I'm seeing that only empty messages are being sent to my
>> domain. I think this may be an issue on how I've setup the TIKA Extractor
>> Transformation or the field mapping. I think the Database where the records
>> are supposed to be stored before flushing to Amazon, is storing empty
>> content.
>>
>> I've tried to find documentation on how to setup the TIKA Transformation,
>> but I haven't been able to find any.
>>
>> If someone could provide an example of a job setup to send from a
>> FileSystem to CloudSearch, that'd be great!
>>
>> Thanks in advance,
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Re: Amazon CloudSearch Connector question

Posted by Karl Wright <da...@gmail.com>.

Hi Juan,

I'd try to reproduce as much of the pipeline as possible using a solr
output connection.  If you include the tika extractor in the pipeline, you
will want to configure the solr connection to not use the extracting update
handler.  There's a checkbox on the Schema tab you need to uncheck for
that.  But if you do that you can see what is being sent to Solr pretty
exactly; it all gets logged in the INFO messages dumped to solr log.  This
should help you figure out if the problem is your tika configuration or not.

Please give this a try and let me know what happens.

Karl

On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <jp...@mcplusa.com>
wrote:

> Hi,
>
> I've successfully sent data to FileSystems and SOLR, but for Amazon
> CloudSearch I'm seeing that only empty messages are being sent to my
> domain. I think this may be an issue on how I've setup the TIKA Extractor
> Transformation or the field mapping. I think the Database where the records
> are supposed to be stored before flushing to Amazon, is storing empty
> content.
>
> I've tried to find documentation on how to setup the TIKA Transformation,
> but I haven't been able to find any.
>
> If someone could provide an example of a job setup to send from a
> FileSystem to CloudSearch, that'd be great!
>
> Thanks in advance,
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>