You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Dileepa Jayakody <di...@gmail.com> on 2017/10/06 12:35:15 UTC
How to extract text content and index in elastic-search
Hi All,
I'm trying out a small demo, with a file system repository connector and
elastic search output connector to extract spreadsheet documents and index.
I've also added tika transform connector in the job.
When I run the documents get indexed in elastic-search but the content is
been indexed in binary.
See below the indexed content in ES. Can I please know how to extract the
spread-sheet content to text format here?
Even for a text file, I see the content is been indexed as binary.
Is there a configuration I need to do here to get the text content
extracted and indexed in ES?
{
"_index": "test",
"_type": "generictype",
"_id":
"file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
"_score": 1,
"_source": {
"stream_size": "101613",
"X-Parsed-By": "org.apache.tika.parser.DefaultParser",
"stream_name": "MI - Project2 - Estimation v1.0.xlsx",
"protected": "false",
"resourceName": "MI - Project2 - Estimation v1.0.xlsx",
"uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2
- Estimation v1.0.xlsx",
"Content-Type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"allow_token_document": "__nosecurity__",
"deny_token_document": "__nosecurity__",
"allow_token_share": "__nosecurity__",
"deny_token_share": "__nosecurity__",
"allow_token_parent": "__nosecurity__",
"deny_token_parent": "__nosecurity__",
"file": {
"_content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"_name": "MI - Project2 - Estimation v1.0.xlsx",
"_content":
"RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
}
}
]
}
}
Thanks,
Dileepa
Re: How to extract text content and index in elastic-search
Posted by Karl Wright <da...@gmail.com>.
Hi Dileepa,
MCF passes content through its processing chain as binary. It's up to the
output connection configuration to decide if the output should be rendered
as text or binary, and it is there that a different decision would need to
be made.
IIRC there's a flag you can set that chooses between binary indexing (using
the mapper attachment) and text (which doesn't do that). But I don't know
enough about ES to know whether this works properly with later versions of
ES, since ES is infamous for not maintaining backwards compatibility
between releases. Can anyone else answer this question?
Karl
On Fri, Oct 6, 2017 at 8:39 AM, Dileepa Jayakody <di...@gmail.com>
wrote:
> Guys, I'm using the latest 2.8.1 release.
>
> Thanks
>
> On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <
> dileepajayakody@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm trying out a small demo, with a file system repository connector and
>> elastic search output connector to extract spreadsheet documents and index.
>> I've also added tika transform connector in the job.
>>
>> When I run the documents get indexed in elastic-search but the content is
>> been indexed in binary.
>>
>> See below the indexed content in ES. Can I please know how to extract the
>> spread-sheet content to text format here?
>> Even for a text file, I see the content is been indexed as binary.
>> Is there a configuration I need to do here to get the text content
>> extracted and indexed in ES?
>>
>> {
>> "_index": "test",
>> "_type": "generictype",
>> "_id": "file:/home/dileepa/Documents/
>> hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
>> "_score": 1,
>> "_source": {
>> "stream_size": "101613",
>> "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
>> "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
>> "protected": "false",
>> "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
>> "uri": "/home/dileepa/Documents/hackathon/test_data/MI -
>> Project2 - Estimation v1.0.xlsx",
>> "Content-Type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>> "content_type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>> "allow_token_document": "__nosecurity__",
>> "deny_token_document": "__nosecurity__",
>> "allow_token_share": "__nosecurity__",
>> "deny_token_share": "__nosecurity__",
>> "allow_token_parent": "__nosecurity__",
>> "deny_token_parent": "__nosecurity__",
>> "file": {
>> "_content_type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>> "_name": "MI - Project2 - Estimation v1.0.xlsx",
>> "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzC
>> glTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0a
>> W9uYWwgaJlYWxpMAkwCTAJ....."
>> }
>> }
>> ]
>> }
>> }
>>
>> Thanks,
>> Dileepa
>>
>
>
RE: How to extract text content and index in elastic-search
Posted by S <st...@remcam.net>.
Hi Deepak
If you're using a later version of ES, you can just add the Ingest Plugin to ES.
Alternatively, add a field name for the Content field in the MFC ES configuration.
I'll check it when I get back.
Steph
-----Original Message-----
From: "Dileepa Jayakody" <di...@gmail.com>
Sent: 06/10/2017 07:39
To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject: Re: How to extract text content and index in elastic-search
Guys, I'm using the latest 2.8.1 release.
Thanks
On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <di...@gmail.com> wrote:
Hi All,
I'm trying out a small demo, with a file system repository connector and elastic search output connector to extract spreadsheet documents and index.
I've also added tika transform connector in the job.
When I run the documents get indexed in elastic-search but the content is been indexed in binary.
See below the indexed content in ES. Can I please know how to extract the spread-sheet content to text format here?
Even for a text file, I see the content is been indexed as binary.
Is there a configuration I need to do here to get the text content extracted and indexed in ES?
{
"_index": "test",
"_type": "generictype",
"_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
"_score": 1,
"_source": {
"stream_size": "101613",
"X-Parsed-By": "org.apache.tika.parser.DefaultParser",
"stream_name": "MI - Project2 - Estimation v1.0.xlsx",
"protected": "false",
"resourceName": "MI - Project2 - Estimation v1.0.xlsx",
"uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2 - Estimation v1.0.xlsx",
"Content-Type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"allow_token_document": "__nosecurity__",
"deny_token_document": "__nosecurity__",
"allow_token_share": "__nosecurity__",
"deny_token_share": "__nosecurity__",
"allow_token_parent": "__nosecurity__",
"deny_token_parent": "__nosecurity__",
"file": {
"_content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"_name": "MI - Project2 - Estimation v1.0.xlsx",
"_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
}
}
]
}
}
Thanks,
Dileepa
Re: How to extract text content and index in elastic-search
Posted by Dileepa Jayakody <di...@gmail.com>.
Guys, I'm using the latest 2.8.1 release.
Thanks
On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <di...@gmail.com>
wrote:
> Hi All,
>
> I'm trying out a small demo, with a file system repository connector and
> elastic search output connector to extract spreadsheet documents and index.
> I've also added tika transform connector in the job.
>
> When I run the documents get indexed in elastic-search but the content is
> been indexed in binary.
>
> See below the indexed content in ES. Can I please know how to extract the
> spread-sheet content to text format here?
> Even for a text file, I see the content is been indexed as binary.
> Is there a configuration I need to do here to get the text content
> extracted and indexed in ES?
>
> {
> "_index": "test",
> "_type": "generictype",
> "_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-%
> 20Project2%20-%20Estimation%20v1.0.xlsx",
> "_score": 1,
> "_source": {
> "stream_size": "101613",
> "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
> "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
> "protected": "false",
> "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
> "uri": "/home/dileepa/Documents/hackathon/test_data/MI -
> Project2 - Estimation v1.0.xlsx",
> "Content-Type": "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet",
> "content_type": "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet",
> "allow_token_document": "__nosecurity__",
> "deny_token_document": "__nosecurity__",
> "allow_token_share": "__nosecurity__",
> "deny_token_share": "__nosecurity__",
> "allow_token_parent": "__nosecurity__",
> "deny_token_parent": "__nosecurity__",
> "file": {
> "_content_type": "application/vnd.
> openxmlformats-officedocument.spreadsheetml.sheet",
> "_name": "MI - Project2 - Estimation v1.0.xlsx",
> "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCg
> lTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW
> 9uYWwgaJlYWxpMAkwCTAJ....."
> }
> }
> ]
> }
> }
>
> Thanks,
> Dileepa
>