You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Jean-Marc Vanel <je...@gmail.com> on 2018/01/29 09:14:49 UTC

indexing text in HTML content

Hi

With semantic_forms one can create content with an HTML editor in
JavaScript.

Example:
http://semantic-forms.cc:9112/download?url=http%3A%2F%2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=Turtle
and how it looks in the UI :
http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990

My question is:
Does Jena text indexing process the tags in HTML (or XML) content ?
If yes , <bold> would be indexed in Lucene, which is not desirable.

Nothing is said in these 2 pages:
https://jena.apache.org/documentation/notes/typed-literals.html
https://jena.apache.org/documentation/query/text-query.html

-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Re: indexing text in HTML content

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Jean-Marc!

Lorenz is correct. You can use pretty much any Lucene analyzer with 
jena-text, but there isn't one for HTML AFAIK so you'd have to write 
your own and add it to the jena-text codebase (or Lucene itself).

I see that Elasticsearch has an HTML Strip Char Filter:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

I don't think the current jena-text Elasticsearch backend is 
configurable enough to just start using it as it is, but it probably 
wouldn't be very difficult to add. The Lucene side already supports 
arbitrary analyzers (including filters) through assembler configuration.

-Osma

Jean-Marc Vanel kirjoitti 29.01.2018 klo 12:31:
> Vielen Dank Lorenz !
> 
> This is annoying; I can't preprocess the literals before putting them in
> TDB, because TDB *is* the database for my CMS + social network.
> And duplication of data would be a mess.
> But maybe there is a way to preprocess the literals before putting them in
> the underlying Lucene.
> 
> This being said, the most frequent tags , <p> and <div> are not likely to
> be search strings from the user.
> So this is not a big problem,
> but I felt it an interesting problem.
> 
> 
> 
> 
> 
> 2018-01-29 11:12 GMT+01:00 Lorenz Buehmann <
> buehmann@informatik.uni-leipzig.de>:
> 
>> I guess it simply uses the Lucene Standard Analyzer, thus, yes the tags
>> will be indexed. There isn't a HTML analyzer in Lucene AFAIK, which
>> means you have to preprocess the literals first via Apache Tika or
>> something like JSoup before you add them to the triple store.
>>
>>
>> Lorenz
>>
>>
>>
>> On 29.01.2018 10:14, Jean-Marc Vanel wrote:
>>> Hi
>>>
>>> With semantic_forms one can create content with an HTML editor in
>>> JavaScript.
>>>
>>> Example:
>>> http://semantic-forms.cc:9112/download?url=http%3A%2F%
>> 2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=
>> Turtle
>>> and how it looks in the UI :
>>> http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990
>>>
>>> My question is:
>>> Does Jena text indexing process the tags in HTML (or XML) content ?
>>> If yes , <bold> would be indexed in Lucene, which is not desirable.
>>>
>>> Nothing is said in these 2 pages:
>>> https://jena.apache.org/documentation/notes/typed-literals.html
>>> https://jena.apache.org/documentation/query/text-query.html
>>>
>>
>>
> 
> 


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: indexing text in HTML content

Posted by Jean-Marc Vanel <je...@gmail.com>.
Vielen Dank Lorenz !

This is annoying; I can't preprocess the literals before putting them in
TDB, because TDB *is* the database for my CMS + social network.
And duplication of data would be a mess.
But maybe there is a way to preprocess the literals before putting them in
the underlying Lucene.

This being said, the most frequent tags , <p> and <div> are not likely to
be search strings from the user.
So this is not a big problem,
but I felt it an interesting problem.





2018-01-29 11:12 GMT+01:00 Lorenz Buehmann <
buehmann@informatik.uni-leipzig.de>:

> I guess it simply uses the Lucene Standard Analyzer, thus, yes the tags
> will be indexed. There isn't a HTML analyzer in Lucene AFAIK, which
> means you have to preprocess the literals first via Apache Tika or
> something like JSoup before you add them to the triple store.
>
>
> Lorenz
>
>
>
> On 29.01.2018 10:14, Jean-Marc Vanel wrote:
> > Hi
> >
> > With semantic_forms one can create content with an HTML editor in
> > JavaScript.
> >
> > Example:
> > http://semantic-forms.cc:9112/download?url=http%3A%2F%
> 2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=
> Turtle
> > and how it looks in the UI :
> > http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990
> >
> > My question is:
> > Does Jena text indexing process the tags in HTML (or XML) content ?
> > If yes , <bold> would be indexed in Lucene, which is not desirable.
> >
> > Nothing is said in these 2 pages:
> > https://jena.apache.org/documentation/notes/typed-literals.html
> > https://jena.apache.org/documentation/query/text-query.html
> >
>
>


-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Re: indexing text in HTML content

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
I guess it simply uses the Lucene Standard Analyzer, thus, yes the tags
will be indexed. There isn't a HTML analyzer in Lucene AFAIK, which
means you have to preprocess the literals first via Apache Tika or
something like JSoup before you add them to the triple store.


Lorenz



On 29.01.2018 10:14, Jean-Marc Vanel wrote:
> Hi
>
> With semantic_forms one can create content with an HTML editor in
> JavaScript.
>
> Example:
> http://semantic-forms.cc:9112/download?url=http%3A%2F%2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=Turtle
> and how it looks in the UI :
> http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990
>
> My question is:
> Does Jena text indexing process the tags in HTML (or XML) content ?
> If yes , <bold> would be indexed in Lucene, which is not desirable.
>
> Nothing is said in these 2 pages:
> https://jena.apache.org/documentation/notes/typed-literals.html
> https://jena.apache.org/documentation/query/text-query.html
>