You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Mikael Pesonen <mi...@lingsoft.fi> on 2017/11/22 13:44:16 UTC

Similar results with full text search

Are there any plans on implementing similar text search for Jena?

Until similarity is implemented, is it possible to query similar texts 
using Lucene directly, bypassing Jena, but with the same data set?

Br,

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Similar results with full text search

Posted by Mikael Pesonen <mi...@lingsoft.fi>.

Thanks for the tips! I'll check what ES can do.

Br,
Mikael


On 23.11.2017 13:36, Osma Suominen wrote:
> Hi Mikael!
>
> Not sure how jena-text could help here if the documents are in another 
> index. But maybe you could look at using the Elasticsearch backend of 
> jena-text. It stores the index in ES, so it can also be queried 
> outside Jena. If you had the documents + jena-text indexed metadata in 
> ES, you could use ES facilities for similarity search and still do 
> some things in SPARQL.
>
> -Osma
>
>
> Mikael Pesonen kirjoitti 23.11.2017 klo 12:59:
>>
>> Hi Osma!
>>
>> we have a set of documents and their metadata. Metadata is stored in 
>> Jena and texts in separate database (RDF id, content).
>>
>> First case would be able to search documents and list their metadata 
>> based on document content using SPARQL.  I'm not sure if even this is 
>> possible.
>>
>> Second, similar search would return id's of similar documents of a 
>> document based on metadata and content.
>>
>>
>> We have already set this up as a separate Lucene installation. First 
>> querying documents from Lucene index, then filtering the result sets 
>> with additional meta fields using Jena. This setup is quite 
>> complicated so was hoping a tighter integration to Jena would make 
>> things easier.
>>
>> Br,
>> Mikael
>>
>>
>> On 22.11.2017 22:40, Osma Suominen wrote:
>>> Hi Mikael!
>>>
>>> Sorry, I probably misunderstood - I somehow read "similar" as 
>>> meaning "fuzzy" but they are of course not the same thing. So if you 
>>> mean "give me documents similar to document X", that's called 
>>> MoreLikeThis in Lucene, and it's currently not supported by 
>>> jena-text. What's your use case? How would you like to use it if it 
>>> existed?
>>>
>>> -Osma
>>>
>>> Osma Suominen kirjoitti 22.11.2017 klo 22:37:
>>>> Hi Mikael!
>>>>
>>>> Fuzzy search is a basic Lucene feature, just like prefix searches. 
>>>> You should be able to use it directly via jena-text using a query like
>>>> ?s text:query "word~"
>>>> or
>>>> ?s text:query "word~1"
>>>>
>>>> There is AFAICT nothing to implement on the jena-text side as this 
>>>> already works right now.
>>>>
>>>> -Osma
>>>>
>>>> Mikael Pesonen kirjoitti 22.11.2017 klo 15:44:
>>>>>
>>>>> Are there any plans on implementing similar text search for Jena?
>>>>>
>>>>> Until similarity is implemented, is it possible to query similar 
>>>>> texts using Lucene directly, bypassing Jena, but with the same 
>>>>> data set?
>>>>>
>>>>> Br,
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Similar results with full text search

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Mikael!

Not sure how jena-text could help here if the documents are in another 
index. But maybe you could look at using the Elasticsearch backend of 
jena-text. It stores the index in ES, so it can also be queried outside 
Jena. If you had the documents + jena-text indexed metadata in ES, you 
could use ES facilities for similarity search and still do some things 
in SPARQL.

-Osma


Mikael Pesonen kirjoitti 23.11.2017 klo 12:59:
> 
> Hi Osma!
> 
> we have a set of documents and their metadata. Metadata is stored in 
> Jena and texts in separate database (RDF id, content).
> 
> First case would be able to search documents and list their metadata 
> based on document content using SPARQL.  I'm not sure if even this is 
> possible.
> 
> Second, similar search would return id's of similar documents of a 
> document based on metadata and content.
> 
> 
> We have already set this up as a separate Lucene installation. First 
> querying documents from Lucene index, then filtering the result sets 
> with additional meta fields using Jena. This setup is quite complicated 
> so was hoping a tighter integration to Jena would make things easier.
> 
> Br,
> Mikael
> 
> 
> On 22.11.2017 22:40, Osma Suominen wrote:
>> Hi Mikael!
>>
>> Sorry, I probably misunderstood - I somehow read "similar" as meaning 
>> "fuzzy" but they are of course not the same thing. So if you mean 
>> "give me documents similar to document X", that's called MoreLikeThis 
>> in Lucene, and it's currently not supported by jena-text. What's your 
>> use case? How would you like to use it if it existed?
>>
>> -Osma
>>
>> Osma Suominen kirjoitti 22.11.2017 klo 22:37:
>>> Hi Mikael!
>>>
>>> Fuzzy search is a basic Lucene feature, just like prefix searches. 
>>> You should be able to use it directly via jena-text using a query like
>>> ?s text:query "word~"
>>> or
>>> ?s text:query "word~1"
>>>
>>> There is AFAICT nothing to implement on the jena-text side as this 
>>> already works right now.
>>>
>>> -Osma
>>>
>>> Mikael Pesonen kirjoitti 22.11.2017 klo 15:44:
>>>>
>>>> Are there any plans on implementing similar text search for Jena?
>>>>
>>>> Until similarity is implemented, is it possible to query similar 
>>>> texts using Lucene directly, bypassing Jena, but with the same data 
>>>> set?
>>>>
>>>> Br,
>>>>
>>>
>>>
>>
>>
> 


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Similar results with full text search

Posted by Mikael Pesonen <mi...@lingsoft.fi>.

Hi Osma!

we have a set of documents and their metadata. Metadata is stored in 
Jena and texts in separate database (RDF id, content).

First case would be able to search documents and list their metadata 
based on document content using SPARQL.  I'm not sure if even this is 
possible.

Second, similar search would return id's of similar documents of a 
document based on metadata and content.

We have already set this up as a separate Lucene installation. First 
querying documents from Lucene index, then filtering the result sets 
with additional meta fields using Jena. This setup is quite complicated 
so was hoping a tighter integration to Jena would make things easier.

Br,
Mikael

On 22.11.2017 22:40, Osma Suominen wrote:
> Hi Mikael!
>
> Sorry, I probably misunderstood - I somehow read "similar" as meaning 
> "fuzzy" but they are of course not the same thing. So if you mean 
> "give me documents similar to document X", that's called MoreLikeThis 
> in Lucene, and it's currently not supported by jena-text. What's your 
> use case? How would you like to use it if it existed?
>
> -Osma
>
> Osma Suominen kirjoitti 22.11.2017 klo 22:37:
>> Hi Mikael!
>>
>> Fuzzy search is a basic Lucene feature, just like prefix searches. 
>> You should be able to use it directly via jena-text using a query like
>> ?s text:query "word~"
>> or
>> ?s text:query "word~1"
>>
>> There is AFAICT nothing to implement on the jena-text side as this 
>> already works right now.
>>
>> -Osma
>>
>> Mikael Pesonen kirjoitti 22.11.2017 klo 15:44:
>>>
>>> Are there any plans on implementing similar text search for Jena?
>>>
>>> Until similarity is implemented, is it possible to query similar 
>>> texts using Lucene directly, bypassing Jena, but with the same data 
>>> set?
>>>
>>> Br,
>>>
>>
>>
>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Similar results with full text search

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Mikael!

Sorry, I probably misunderstood - I somehow read "similar" as meaning 
"fuzzy" but they are of course not the same thing. So if you mean "give 
me documents similar to document X", that's called MoreLikeThis in 
Lucene, and it's currently not supported by jena-text. What's your use 
case? How would you like to use it if it existed?

-Osma

Osma Suominen kirjoitti 22.11.2017 klo 22:37:
> Hi Mikael!
> 
> Fuzzy search is a basic Lucene feature, just like prefix searches. You 
> should be able to use it directly via jena-text using a query like
> ?s text:query "word~"
> or
> ?s text:query "word~1"
> 
> There is AFAICT nothing to implement on the jena-text side as this 
> already works right now.
> 
> -Osma
> 
> Mikael Pesonen kirjoitti 22.11.2017 klo 15:44:
>>
>> Are there any plans on implementing similar text search for Jena?
>>
>> Until similarity is implemented, is it possible to query similar texts 
>> using Lucene directly, bypassing Jena, but with the same data set?
>>
>> Br,
>>
> 
> 


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Similar results with full text search

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Mikael!

Fuzzy search is a basic Lucene feature, just like prefix searches. You 
should be able to use it directly via jena-text using a query like
?s text:query "word~"
or
?s text:query "word~1"

There is AFAICT nothing to implement on the jena-text side as this 
already works right now.

-Osma

Mikael Pesonen kirjoitti 22.11.2017 klo 15:44:
> 
> Are there any plans on implementing similar text search for Jena?
> 
> Until similarity is implemented, is it possible to query similar texts 
> using Lucene directly, bypassing Jena, but with the same data set?
> 
> Br,
> 


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi