You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2019/01/14 03:18:47 UTC

Content from EML files indexing from text/html (which is not clean) instead of text/plain

Hi,

I am using Solr 7.5.0 with Tika 1.18.

Currently I am facing a situation during the indexing of EML files, whereby
the content is being extracted from the Content-type=text/html instead of
Content-type=text/plain.

The problem with Content-type=text/html is that it contains alot of words
like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
these get indexed in Solr as well, which makes the content very cluttered,
and it also affect the search, as when we search for words like "font", all
the contents gets returned because of this.

Would like to enquire on the following:
1. Why Tika didn't get the text part (text/plain). Is there any way to
configure the Tika in Solr to change the priority to get the text part
(text/plain) instead of html part (text/html).
2. If that is not possible, as you can see, the content is not clean, which
is not right. How can we get this to be clean when Tika is extracting text?

Regards,
Edwin

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Ok, thanks for providing the information.

Regards,
Edwin

On Fri, 18 Jan 2019 at 00:33, Tim Allison <ta...@apache.org> wrote:

> Y, I tracked this down within Solr.  This is a feature, not a bug.  I
> found a solution (set {{captureAttr}} to {{true}}):
>
> https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263
>
> Please, though, for the sake of Solr, please run Tika outside of Solr
> in production (e.g. SolrJ...see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/)
>
> On Thu, Jan 17, 2019 at 2:15 AM Zheng Lin Edwin Yeo
> <ed...@gmail.com> wrote:
> >
> > Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
> > said that the issue could be with the Solr's ExtractingRequestHandler, in
> > which the HTMLParser is either not being applied, or is somehow not
> > stripping the content of <span/> elements. Straight Tika app is able to
> do
> > the right thing.
> >
> > Regards,
> > Edwin
> >
> > On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> > > Hi Alex,
> > >
> > > Thanks for the suggestions.
> > > Yes, I have posted it in the Tika mailing list too.
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <
> arafalov@gmail.com>
> > > wrote:
> > >
> > >> I think asking this question on Tika mailing list may give you better
> > >> answers. Then, if the conclusion is that the behavior is configurable,
> > >> you can see how to do it in Solr. It may be however, that you need to
> > >> do the parsing outside of Solr with standalone Tika. Standalone Tika
> > >> is a production advice anyway.
> > >>
> > >> I would suggest the title be something like "How to prefer plain/text
> > >> part of an email message when parsing .eml files".
> > >>
> > >> Regards,
> > >>   Alex.
> > >>
> > >> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I have uploaded a sample EML file here:
> > >> >
> > >>
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> > >> >
> > >> > This is what is indexed in the content:
> > >> >
> > >> >         "content":"  font-size: 14pt; font-family: book antiqua,
> > >> > palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
> > >> > book antiqua, palatino, serif;  My client owns the domain name “
> > >> > font-size: 14pt; color: #0000ff; font-family: arial black,
> sans-serif;
> > >> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> > >> > antiqua, palatino, serif;  ” and is considering putting it in
> market.
> > >> > It is keyword rich domain with good search volume,adword bidding and
> > >> > type-in-traffic.   <br><br> font-size: 14pt; font-family: book
> > >> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> > >> > feel that you should consider buying this domain name to improve the
> > >> > SEO, Online visibility, brand image, authority and type-in-traffic
> for
> > >> > your business. We also do provide free 1 year hosting and unlimited
> > >> > emails along with domain name.   <br><br> font-size: 14pt;
> > >> > font-family: book antiqua, palatino, serif;  Besides this, if you
> need
> > >> > any other domain name, web and app designing services and digital
> > >> > marketing services (SEO, PPC and SMO) at reasonable charges, feel
> free
> > >> > to contact us.   <br><br> font-size: 14pt; font-family: book
> antiqua,
> > >> > palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
> > >> > font-family: book antiqua, palatino, serif;  Josh   <br><br>",
> > >> >
> > >> >
> > >> > As you can see, this is taken from the Content-Type: text/html.
> > >> > However, the Content-Type: text/plain looks clean, and that is what
> we
> > >> want
> > >> > it to be indexed.
> > >> >
> > >> > How can we configure the Tika in Solr to change the priority to get
> the
> > >> > content from Content-Type: text/plain  instead of Content-Type:
> > >> text/html?
> > >> >
> > >> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I am using Solr 7.5.0 with Tika 1.18.
> > >> > >
> > >> > > Currently I am facing a situation during the indexing of EML
> files,
> > >> > > whereby the content is being extracted from the
> Content-type=text/html
> > >> > > instead of Content-type=text/plain.
> > >> > >
> > >> > > The problem with Content-type=text/html is that it contains alot
> of
> > >> words
> > >> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and
> all of
> > >> > > these get indexed in Solr as well, which makes the content very
> > >> cluttered,
> > >> > > and it also affect the search, as when we search for words like
> > >> "font", all
> > >> > > the contents gets returned because of this.
> > >> > >
> > >> > > Would like to enquire on the following:
> > >> > > 1. Why Tika didn't get the text part (text/plain). Is there any
> way to
> > >> > > configure the Tika in Solr to change the priority to get the text
> part
> > >> > > (text/plain) instead of html part (text/html).
> > >> > > 2. If that is not possible, as you can see, the content is not
> clean,
> > >> > > which is not right. How can we get this to be clean when Tika is
> > >> extracting
> > >> > > text?
> > >> > >
> > >> > > Regards,
> > >> > > Edwin
> > >> > >
> > >>
> > >
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Tim Allison <ta...@apache.org>.
Y, I tracked this down within Solr.  This is a feature, not a bug.  I
found a solution (set {{captureAttr}} to {{true}}):
https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263

Please, though, for the sake of Solr, please run Tika outside of Solr
in production (e.g. SolrJ...see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/)

On Thu, Jan 17, 2019 at 2:15 AM Zheng Lin Edwin Yeo
<ed...@gmail.com> wrote:
>
> Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
> said that the issue could be with the Solr's ExtractingRequestHandler, in
> which the HTMLParser is either not being applied, or is somehow not
> stripping the content of <span/> elements. Straight Tika app is able to do
> the right thing.
>
> Regards,
> Edwin
>
> On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi Alex,
> >
> > Thanks for the suggestions.
> > Yes, I have posted it in the Tika mailing list too.
> >
> > Regards,
> > Edwin
> >
> > On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <ar...@gmail.com>
> > wrote:
> >
> >> I think asking this question on Tika mailing list may give you better
> >> answers. Then, if the conclusion is that the behavior is configurable,
> >> you can see how to do it in Solr. It may be however, that you need to
> >> do the parsing outside of Solr with standalone Tika. Standalone Tika
> >> is a production advice anyway.
> >>
> >> I would suggest the title be something like "How to prefer plain/text
> >> part of an email message when parsing .eml files".
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <ed...@gmail.com>
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I have uploaded a sample EML file here:
> >> >
> >> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> >> >
> >> > This is what is indexed in the content:
> >> >
> >> >         "content":"  font-size: 14pt; font-family: book antiqua,
> >> > palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
> >> > book antiqua, palatino, serif;  My client owns the domain name “
> >> > font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
> >> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> >> > antiqua, palatino, serif;  ” and is considering putting it in market.
> >> > It is keyword rich domain with good search volume,adword bidding and
> >> > type-in-traffic.   <br><br> font-size: 14pt; font-family: book
> >> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> >> > feel that you should consider buying this domain name to improve the
> >> > SEO, Online visibility, brand image, authority and type-in-traffic for
> >> > your business. We also do provide free 1 year hosting and unlimited
> >> > emails along with domain name.   <br><br> font-size: 14pt;
> >> > font-family: book antiqua, palatino, serif;  Besides this, if you need
> >> > any other domain name, web and app designing services and digital
> >> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
> >> > to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
> >> > palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
> >> > font-family: book antiqua, palatino, serif;  Josh   <br><br>",
> >> >
> >> >
> >> > As you can see, this is taken from the Content-Type: text/html.
> >> > However, the Content-Type: text/plain looks clean, and that is what we
> >> want
> >> > it to be indexed.
> >> >
> >> > How can we configure the Tika in Solr to change the priority to get the
> >> > content from Content-Type: text/plain  instead of Content-Type:
> >> text/html?
> >> >
> >> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >> >
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I am using Solr 7.5.0 with Tika 1.18.
> >> > >
> >> > > Currently I am facing a situation during the indexing of EML files,
> >> > > whereby the content is being extracted from the Content-type=text/html
> >> > > instead of Content-type=text/plain.
> >> > >
> >> > > The problem with Content-type=text/html is that it contains alot of
> >> words
> >> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> >> > > these get indexed in Solr as well, which makes the content very
> >> cluttered,
> >> > > and it also affect the search, as when we search for words like
> >> "font", all
> >> > > the contents gets returned because of this.
> >> > >
> >> > > Would like to enquire on the following:
> >> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
> >> > > configure the Tika in Solr to change the priority to get the text part
> >> > > (text/plain) instead of html part (text/html).
> >> > > 2. If that is not possible, as you can see, the content is not clean,
> >> > > which is not right. How can we get this to be clean when Tika is
> >> extracting
> >> > > text?
> >> > >
> >> > > Regards,
> >> > > Edwin
> >> > >
> >>
> >

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
said that the issue could be with the Solr's ExtractingRequestHandler, in
which the HTMLParser is either not being applied, or is somehow not
stripping the content of <span/> elements. Straight Tika app is able to do
the right thing.

Regards,
Edwin

On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Alex,
>
> Thanks for the suggestions.
> Yes, I have posted it in the Tika mailing list too.
>
> Regards,
> Edwin
>
> On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> I think asking this question on Tika mailing list may give you better
>> answers. Then, if the conclusion is that the behavior is configurable,
>> you can see how to do it in Solr. It may be however, that you need to
>> do the parsing outside of Solr with standalone Tika. Standalone Tika
>> is a production advice anyway.
>>
>> I would suggest the title be something like "How to prefer plain/text
>> part of an email message when parsing .eml files".
>>
>> Regards,
>>   Alex.
>>
>> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I have uploaded a sample EML file here:
>> >
>> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
>> >
>> > This is what is indexed in the content:
>> >
>> >         "content":"  font-size: 14pt; font-family: book antiqua,
>> > palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
>> > book antiqua, palatino, serif;  My client owns the domain name “
>> > font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
>> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
>> > antiqua, palatino, serif;  ” and is considering putting it in market.
>> > It is keyword rich domain with good search volume,adword bidding and
>> > type-in-traffic.   <br><br> font-size: 14pt; font-family: book
>> > antiqua, palatino, serif;  Based on our extensive study, we strongly
>> > feel that you should consider buying this domain name to improve the
>> > SEO, Online visibility, brand image, authority and type-in-traffic for
>> > your business. We also do provide free 1 year hosting and unlimited
>> > emails along with domain name.   <br><br> font-size: 14pt;
>> > font-family: book antiqua, palatino, serif;  Besides this, if you need
>> > any other domain name, web and app designing services and digital
>> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
>> > to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
>> > palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
>> > font-family: book antiqua, palatino, serif;  Josh   <br><br>",
>> >
>> >
>> > As you can see, this is taken from the Content-Type: text/html.
>> > However, the Content-Type: text/plain looks clean, and that is what we
>> want
>> > it to be indexed.
>> >
>> > How can we configure the Tika in Solr to change the priority to get the
>> > content from Content-Type: text/plain  instead of Content-Type:
>> text/html?
>> >
>> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am using Solr 7.5.0 with Tika 1.18.
>> > >
>> > > Currently I am facing a situation during the indexing of EML files,
>> > > whereby the content is being extracted from the Content-type=text/html
>> > > instead of Content-type=text/plain.
>> > >
>> > > The problem with Content-type=text/html is that it contains alot of
>> words
>> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
>> > > these get indexed in Solr as well, which makes the content very
>> cluttered,
>> > > and it also affect the search, as when we search for words like
>> "font", all
>> > > the contents gets returned because of this.
>> > >
>> > > Would like to enquire on the following:
>> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
>> > > configure the Tika in Solr to change the priority to get the text part
>> > > (text/plain) instead of html part (text/html).
>> > > 2. If that is not possible, as you can see, the content is not clean,
>> > > which is not right. How can we get this to be clean when Tika is
>> extracting
>> > > text?
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>>
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Alex,

Thanks for the suggestions.
Yes, I have posted it in the Tika mailing list too.

Regards,
Edwin

On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> I think asking this question on Tika mailing list may give you better
> answers. Then, if the conclusion is that the behavior is configurable,
> you can see how to do it in Solr. It may be however, that you need to
> do the parsing outside of Solr with standalone Tika. Standalone Tika
> is a production advice anyway.
>
> I would suggest the title be something like "How to prefer plain/text
> part of an email message when parsing .eml files".
>
> Regards,
>   Alex.
>
> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I have uploaded a sample EML file here:
> >
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> >
> > This is what is indexed in the content:
> >
> >         "content":"  font-size: 14pt; font-family: book antiqua,
> > palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
> > book antiqua, palatino, serif;  My client owns the domain name “
> > font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> > antiqua, palatino, serif;  ” and is considering putting it in market.
> > It is keyword rich domain with good search volume,adword bidding and
> > type-in-traffic.   <br><br> font-size: 14pt; font-family: book
> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> > feel that you should consider buying this domain name to improve the
> > SEO, Online visibility, brand image, authority and type-in-traffic for
> > your business. We also do provide free 1 year hosting and unlimited
> > emails along with domain name.   <br><br> font-size: 14pt;
> > font-family: book antiqua, palatino, serif;  Besides this, if you need
> > any other domain name, web and app designing services and digital
> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
> > to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
> > palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
> > font-family: book antiqua, palatino, serif;  Josh   <br><br>",
> >
> >
> > As you can see, this is taken from the Content-Type: text/html.
> > However, the Content-Type: text/plain looks clean, and that is what we
> want
> > it to be indexed.
> >
> > How can we configure the Tika in Solr to change the priority to get the
> > content from Content-Type: text/plain  instead of Content-Type:
> text/html?
> >
> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am using Solr 7.5.0 with Tika 1.18.
> > >
> > > Currently I am facing a situation during the indexing of EML files,
> > > whereby the content is being extracted from the Content-type=text/html
> > > instead of Content-type=text/plain.
> > >
> > > The problem with Content-type=text/html is that it contains alot of
> words
> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> > > these get indexed in Solr as well, which makes the content very
> cluttered,
> > > and it also affect the search, as when we search for words like
> "font", all
> > > the contents gets returned because of this.
> > >
> > > Would like to enquire on the following:
> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
> > > configure the Tika in Solr to change the priority to get the text part
> > > (text/plain) instead of html part (text/html).
> > > 2. If that is not possible, as you can see, the content is not clean,
> > > which is not right. How can we get this to be clean when Tika is
> extracting
> > > text?
> > >
> > > Regards,
> > > Edwin
> > >
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I think asking this question on Tika mailing list may give you better
answers. Then, if the conclusion is that the behavior is configurable,
you can see how to do it in Solr. It may be however, that you need to
do the parsing outside of Solr with standalone Tika. Standalone Tika
is a production advice anyway.

I would suggest the title be something like "How to prefer plain/text
part of an email message when parsing .eml files".

Regards,
  Alex.

On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
>
> Hi,
>
> I have uploaded a sample EML file here:
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
>
> This is what is indexed in the content:
>
>         "content":"  font-size: 14pt; font-family: book antiqua,
> palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
> book antiqua, palatino, serif;  My client owns the domain name “
> font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
>  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> antiqua, palatino, serif;  ” and is considering putting it in market.
> It is keyword rich domain with good search volume,adword bidding and
> type-in-traffic.   <br><br> font-size: 14pt; font-family: book
> antiqua, palatino, serif;  Based on our extensive study, we strongly
> feel that you should consider buying this domain name to improve the
> SEO, Online visibility, brand image, authority and type-in-traffic for
> your business. We also do provide free 1 year hosting and unlimited
> emails along with domain name.   <br><br> font-size: 14pt;
> font-family: book antiqua, palatino, serif;  Besides this, if you need
> any other domain name, web and app designing services and digital
> marketing services (SEO, PPC and SMO) at reasonable charges, feel free
> to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
> palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
> font-family: book antiqua, palatino, serif;  Josh   <br><br>",
>
>
> As you can see, this is taken from the Content-Type: text/html.
> However, the Content-Type: text/plain looks clean, and that is what we want
> it to be indexed.
>
> How can we configure the Tika in Solr to change the priority to get the
> content from Content-Type: text/plain  instead of Content-Type: text/html?
>
> On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am using Solr 7.5.0 with Tika 1.18.
> >
> > Currently I am facing a situation during the indexing of EML files,
> > whereby the content is being extracted from the Content-type=text/html
> > instead of Content-type=text/plain.
> >
> > The problem with Content-type=text/html is that it contains alot of words
> > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> > these get indexed in Solr as well, which makes the content very cluttered,
> > and it also affect the search, as when we search for words like "font", all
> > the contents gets returned because of this.
> >
> > Would like to enquire on the following:
> > 1. Why Tika didn't get the text part (text/plain). Is there any way to
> > configure the Tika in Solr to change the priority to get the text part
> > (text/plain) instead of html part (text/html).
> > 2. If that is not possible, as you can see, the content is not clean,
> > which is not right. How can we get this to be clean when Tika is extracting
> > text?
> >
> > Regards,
> > Edwin
> >

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

I have uploaded a sample EML file here:
https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing

This is what is indexed in the content:

        "content":"  font-size: 14pt; font-family: book antiqua,
palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
book antiqua, palatino, serif;  My client owns the domain name “
font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
 TravelInsuranceEurope.com   font-size: 14pt; font-family: book
antiqua, palatino, serif;  ” and is considering putting it in market.
It is keyword rich domain with good search volume,adword bidding and
type-in-traffic.   <br><br> font-size: 14pt; font-family: book
antiqua, palatino, serif;  Based on our extensive study, we strongly
feel that you should consider buying this domain name to improve the
SEO, Online visibility, brand image, authority and type-in-traffic for
your business. We also do provide free 1 year hosting and unlimited
emails along with domain name.   <br><br> font-size: 14pt;
font-family: book antiqua, palatino, serif;  Besides this, if you need
any other domain name, web and app designing services and digital
marketing services (SEO, PPC and SMO) at reasonable charges, feel free
to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
font-family: book antiqua, palatino, serif;  Josh   <br><br>",


As you can see, this is taken from the Content-Type: text/html.
However, the Content-Type: text/plain looks clean, and that is what we want
it to be indexed.

How can we configure the Tika in Solr to change the priority to get the
content from Content-Type: text/plain  instead of Content-Type: text/html?

On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files,
> whereby the content is being extracted from the Content-type=text/html
> instead of Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean,
> which is not right. How can we get this to be clean when Tika is extracting
> text?
>
> Regards,
> Edwin
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Posted by Terry Steichen <te...@net-frame.com>.
Using 6.6.0, I am able to index EML files just fine.  The trick is, when
indexing files containing .eml, add "-filetypes eml" to the commandline
(note the plural filetypes).

Terry Steichen

On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files, whereby
> the content is being extracted from the Content-type=text/html instead of
> Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean, which
> is not right. How can we get this to be clean when Tika is extracting text?
>
> Regards,
> Edwin
>