You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Dileepa Jayakody <di...@gmail.com> on 2017/10/06 12:35:15 UTC

How to extract text content and index in elastic-search

Hi All,

I'm trying out a small demo, with a file system repository connector and
elastic search output connector to extract spreadsheet documents and index.
I've also added tika transform connector in the job.

When I run the documents get indexed in elastic-search but the content is
been indexed in binary.

See below the indexed content in ES. Can I please know how to extract the
spread-sheet content to text format here?
Even for a text file, I see the content is been indexed as binary.
Is there a configuration I need to do here to get the text content
extracted and indexed in ES?

{
        "_index": "test",
        "_type": "generictype",
        "_id":
"file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
        "_score": 1,
        "_source": {
          "stream_size": "101613",
          "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
          "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
          "protected": "false",
          "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
          "uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2
- Estimation v1.0.xlsx",
          "Content-Type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "allow_token_document": "__nosecurity__",
          "deny_token_document": "__nosecurity__",
          "allow_token_share": "__nosecurity__",
          "deny_token_share": "__nosecurity__",
          "allow_token_parent": "__nosecurity__",
          "deny_token_parent": "__nosecurity__",
          "file": {
            "_content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "_name": "MI - Project2 - Estimation v1.0.xlsx",
            "_content":
"RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
        }
      }
    ]
  }
}

Thanks,
Dileepa

Re: How to extract text content and index in elastic-search

Posted by Karl Wright <da...@gmail.com>.

Hi Dileepa,

MCF passes content through its processing chain as binary.  It's up to the
output connection configuration to decide if the output should be rendered
as text or binary, and it is there that a different decision would need to
be made.

IIRC there's a flag you can set that chooses between binary indexing (using
the mapper attachment) and text (which doesn't do that).  But I don't know
enough about ES to know whether this works properly with later versions of
ES, since ES is infamous for not maintaining backwards compatibility
between releases.  Can anyone else answer this question?

Karl


On Fri, Oct 6, 2017 at 8:39 AM, Dileepa Jayakody <di...@gmail.com>
wrote:

> Guys, I'm using the latest 2.8.1 release.
>
> Thanks
>
> On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <
> dileepajayakody@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm trying out a small demo, with a file system repository connector and
>> elastic search output connector to extract spreadsheet documents and index.
>> I've also added tika transform connector in the job.
>>
>> When I run the documents get indexed in elastic-search but the content is
>> been indexed in binary.
>>
>> See below the indexed content in ES. Can I please know how to extract the
>> spread-sheet content to text format here?
>> Even for a text file, I see the content is been indexed as binary.
>> Is there a configuration I need to do here to get the text content
>> extracted and indexed in ES?
>>
>> {
>>         "_index": "test",
>>         "_type": "generictype",
>>         "_id": "file:/home/dileepa/Documents/
>> hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
>>         "_score": 1,
>>         "_source": {
>>           "stream_size": "101613",
>>           "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
>>           "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
>>           "protected": "false",
>>           "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
>>           "uri": "/home/dileepa/Documents/hackathon/test_data/MI -
>> Project2 - Estimation v1.0.xlsx",
>>           "Content-Type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>>           "content_type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>>           "allow_token_document": "__nosecurity__",
>>           "deny_token_document": "__nosecurity__",
>>           "allow_token_share": "__nosecurity__",
>>           "deny_token_share": "__nosecurity__",
>>           "allow_token_parent": "__nosecurity__",
>>           "deny_token_parent": "__nosecurity__",
>>           "file": {
>>             "_content_type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>>             "_name": "MI - Project2 - Estimation v1.0.xlsx",
>>             "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzC
>> glTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0a
>> W9uYWwgaJlYWxpMAkwCTAJ....."
>>         }
>>       }
>>     ]
>>   }
>> }
>>
>> Thanks,
>> Dileepa
>>
>
>

RE: How to extract text content and index in elastic-search

Posted by S <st...@remcam.net>.

Hi Deepak
If you're using a later version of ES, you can just add the Ingest Plugin to ES.
Alternatively, add a field name for the Content field in the MFC ES configuration.
I'll check it when I get back.
Steph

-----Original Message-----
From: "Dileepa Jayakody" <di...@gmail.com>
Sent: ‎06/‎10/‎2017 07:39
To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject: Re: How to extract text content and index in elastic-search

Guys, I'm using the latest 2.8.1 release.

Thanks

On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <di...@gmail.com> wrote:

Hi All,

I'm trying out a small demo, with a file system repository connector and elastic search output connector to extract spreadsheet documents and index.

I've also added tika transform connector in the job.

When I run the documents get indexed in elastic-search but the content is been indexed in binary.

See below the indexed content in ES. Can I please know how to extract the spread-sheet content to text format here? 

Even for a text file, I see the content is been indexed as binary. 

Is there a configuration I need to do here to get the text content extracted and indexed in ES?

{
        "_index": "test",
        "_type": "generictype",
        "_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
        "_score": 1,
        "_source": {
          "stream_size": "101613",
          "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
          "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
          "protected": "false",
          "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
          "uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2 - Estimation v1.0.xlsx",
          "Content-Type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "allow_token_document": "__nosecurity__",
          "deny_token_document": "__nosecurity__",
          "allow_token_share": "__nosecurity__",
          "deny_token_share": "__nosecurity__",
          "allow_token_parent": "__nosecurity__",
          "deny_token_parent": "__nosecurity__",
          "file": {
            "_content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "_name": "MI - Project2 - Estimation v1.0.xlsx",
            "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
        }
      }
    ]
  }
}

Thanks,

Dileepa

Re: How to extract text content and index in elastic-search

Posted by Dileepa Jayakody <di...@gmail.com>.

Guys, I'm using the latest 2.8.1 release.

Thanks

On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <di...@gmail.com>
wrote:

> Hi All,
>
> I'm trying out a small demo, with a file system repository connector and
> elastic search output connector to extract spreadsheet documents and index.
> I've also added tika transform connector in the job.
>
> When I run the documents get indexed in elastic-search but the content is
> been indexed in binary.
>
> See below the indexed content in ES. Can I please know how to extract the
> spread-sheet content to text format here?
> Even for a text file, I see the content is been indexed as binary.
> Is there a configuration I need to do here to get the text content
> extracted and indexed in ES?
>
> {
>         "_index": "test",
>         "_type": "generictype",
>         "_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-%
> 20Project2%20-%20Estimation%20v1.0.xlsx",
>         "_score": 1,
>         "_source": {
>           "stream_size": "101613",
>           "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
>           "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
>           "protected": "false",
>           "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
>           "uri": "/home/dileepa/Documents/hackathon/test_data/MI -
> Project2 - Estimation v1.0.xlsx",
>           "Content-Type": "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet",
>           "content_type": "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet",
>           "allow_token_document": "__nosecurity__",
>           "deny_token_document": "__nosecurity__",
>           "allow_token_share": "__nosecurity__",
>           "deny_token_share": "__nosecurity__",
>           "allow_token_parent": "__nosecurity__",
>           "deny_token_parent": "__nosecurity__",
>           "file": {
>             "_content_type": "application/vnd.
> openxmlformats-officedocument.spreadsheetml.sheet",
>             "_name": "MI - Project2 - Estimation v1.0.xlsx",
>             "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCg
> lTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW
> 9uYWwgaJlYWxpMAkwCTAJ....."
>         }
>       }
>     ]
>   }
> }
>
> Thanks,
> Dileepa
>