You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "tomas.kalas" <ka...@email.cz> on 2014/10/30 15:27:50 UTC

Design optimal Solr Schema

Hello i have problem with design of schema in Solr. I have a transcript of a
telephone conversation in this format. I parse it at individual fields. I
have this schema:

<?xml version="1.0"?>
<add>
<doc>
<field name="id">01.cn</field>
<field name="t">0<br /> 1<br /> 2<br /> 2 <br /> 3 <br /> ....</field>
<field name="st">0.00<br /> 1.54<br /> 1.54<br /> 1.54 <br /> 1.57 <br />
....</field>
<field name="et">1.54<br /> 1.54<br /> 1.57<br /> 1.57 <br /> 1.7 <br />
....</field>
<field name="w">_SILENCE_<br /> <s><br /> HELLO<br /> HALLO <br /> _DELETE_
<br /> ....</field>
<field name="p">0.000000<br /> 1<br /> 1<br /> 2.06115e-009 <br /> 1 <br />
....</field>
<field name="c">0<br /> 0<br /> 0<br /> 0 <br /> 0 <br /> ....</field>
</doc>
</add>

I displayed it in html document, and therefore i used the <br />.

This is a original document:

T=0 ST=0.00 ET=1.54 W=_SILENCE_ P=0.000000 C=0
T=1 ST=1.54 ET=1.54 W=<s> P=1 C=0
T=2 ST=1.54 ET=1.57 W=HELLO P=1 C=0
T=2 ST=1.54 ET=1.57 W=HALLO P=2.06115e-009 C=0
T=3 ST=1.57 ET=1.70 W=_DELETE_ P=1 C=0
T=3 ST=1.57 ET=1.70 W=NO P=2.06115e-009 C=0
T=4 ST=1.70 ET=2.12 W=HOW P=1 C=0
T=5 ST=2.12 ET=2.18 W=ARE_ P=0.25 C=0
T=5 ST=2.12 ET=2.18 W=_DELETE_ P=0.25 C=0
..........................................
..........................................

Id - filename
T = Segment
ST = Start time
ET = End time
W = Word
P = Probability
C = Chanel

I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0
TO 1.57]). But if i have all data in one field (t, st,et ...) then it
doesn't work. It find all files where is hello a further time than 1.57.

Do you have any ideas how it make it? Thanks a lot for your help.



--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by "tomas.kalas" <ka...@email.cz>.

Oh yes, i want to display stored data in html file. I have 2 pages, at one
page is form and i show here results.
Result here is link (by ID) at file where is all  conversation in second
page. And how did you mean sepparate each conversation interaction ? Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4166805.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.

Are you going to use the values stored on Solr to display the data in HTML? For searching purposes I suggest to delete all the HTML tags, and store the plain text, for this you could use the HTMLStripCharFilterFactory char filter, this will "clean" your content and only pass the actual text which is in the end what you're going to use. 

If you are going to use the solr result to display the content in an HTML page then I would suggest to keep your index clean and index only the actual searchable text no HTML, I actually use the recommended filter to strip HTML out of crawled HTML pages. Although what a Solr document means to you? An entire conversation is modeled 1 Solr document? have you considered separating each conversation interaction on a document? 

----- Original Message -----
From: "tomas.kalas" <ka...@email.cz>
To: solr-user@lucene.apache.org
Sent: Thursday, October 30, 2014 10:27:50 AM
Subject: Design optimal Solr Schema

Hello i have problem with design of schema in Solr. I have a transcript of a
telephone conversation in this format. I parse it at individual fields. I
have this schema:

<?xml version="1.0"?>
<add>
<doc>
<field name="id">01.cn</field>
<field name="t">0<br /> 1<br /> 2<br /> 2 <br /> 3 <br /> ....</field>
<field name="st">0.00<br /> 1.54<br /> 1.54<br /> 1.54 <br /> 1.57 <br />
....</field>
<field name="et">1.54<br /> 1.54<br /> 1.57<br /> 1.57 <br /> 1.7 <br />
....</field>
<field name="w">_SILENCE_<br /> <s><br /> HELLO<br /> HALLO <br /> _DELETE_
<br /> ....</field>
<field name="p">0.000000<br /> 1<br /> 1<br /> 2.06115e-009 <br /> 1 <br />
....</field>
<field name="c">0<br /> 0<br /> 0<br /> 0 <br /> 0 <br /> ....</field>
</doc>
</add>

I displayed it in html document, and therefore i used the <br />.

This is a original document:

T=0 ST=0.00 ET=1.54 W=_SILENCE_ P=0.000000 C=0
T=1 ST=1.54 ET=1.54 W=<s> P=1 C=0
T=2 ST=1.54 ET=1.57 W=HELLO P=1 C=0
T=2 ST=1.54 ET=1.57 W=HALLO P=2.06115e-009 C=0
T=3 ST=1.57 ET=1.70 W=_DELETE_ P=1 C=0
T=3 ST=1.57 ET=1.70 W=NO P=2.06115e-009 C=0
T=4 ST=1.70 ET=2.12 W=HOW P=1 C=0
T=5 ST=2.12 ET=2.18 W=ARE_ P=0.25 C=0
T=5 ST=2.12 ET=2.18 W=_DELETE_ P=0.25 C=0
..........................................
..........................................

Id - filename
T = Segment
ST = Start time
ET = End time
W = Word
P = Probability
C = Chanel

I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0
TO 1.57]). But if i have all data in one field (t, st,et ...) then it
doesn't work. It find all files where is hello a further time than 1.57.

Do you have any ideas how it make it? Thanks a lot for your help.

--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Ok. Make sure to post in the right topics. People get super confused
when the conversation thread changes.

Maybe ignore this last couple of messages and post the new one as
appropriate (separate or in another thread). That way the right people
will see it.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On 11 December 2014 at 09:16, tomas.kalas <ka...@email.cz> wrote:
> Oh no, i want to answered to this topic, where you help me with the synonym
> filter:
>
> http://lucene.472066.n3.nabble.com/Alternative-searching-td4172339.html
>
> but i was opened this topic too and i checking my answer in google
> translator and copy it here.
>
> Now, i have a edit task, i do not have to search to specific time, but only
> in phrase, but with alternatives. Synonym filter is good idea, but if i have
> at specific word in more cases more altenatives, thats it the problem what i
> now dealing. I asked in this topic:
> http://lucene.472066.n3.nabble.com/Alternative-synonymum-td4173694.html
>
> Sorry for chaos.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4173748.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by "tomas.kalas" <ka...@email.cz>.

Oh no, i want to answered to this topic, where you help me with the synonym
filter:

http://lucene.472066.n3.nabble.com/Alternative-searching-td4172339.html

but i was opened this topic too and i checking my answer in google
translator and copy it here.

Now, i have a edit task, i do not have to search to specific time, but only
in phrase, but with alternatives. Synonym filter is good idea, but if i have
at specific word in more cases more altenatives, thats it the problem what i
now dealing. I asked in this topic:
http://lucene.472066.n3.nabble.com/Alternative-synonymum-td4173694.html

Sorry for chaos.



--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4173748.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Tomas,

You have a difficult use case. You seem to have a speech recognition
domain and you want to be able to search that transcribed text with
reference back to timing. It's an interesting problem, but not an easy
one. Certainly not something one can give you the answer all at once.

The issue here is representation of that text. You want it both
per-word (so you have timing) and as a flowing text (so you could find
it). And then, you also have problems how to express it from the PHP
client.

But here are things you need to think about:
1) Do you have groups in your word sequence. You say find "how are
you" but what about "there ah how" which would be still together in
the stream but is the end of one sentence and start of another. If you
do want to find any sequence of consequent words, you need to index
them together and you end up with one very long document. If not, you
need to decide how you are going to break your continuous text into
groups (based on SILENCE, timing, or something else)

2) Then you have the association of multi-word sequence to time. You
say "Good morning to you" is at 5.25, but that's not possible as each
word has it's own duration. Does it mean the word Good was 5.25? Can
they find "Morning to you" and will it still return 5.25? or 5.28?
This design decision will affect how you index it.

3) And what happens if the matched text happens twice like "Chao" -
hello and "Chao" - goodbye. If you want two separate documents
returned, this implies two documents in Solr. So, that goes hand in
hand with (1) above.

4) Then you have a whole highlighting issue, which I am not even going
to start on, except that the text being highlighted needs to be in one
field, so that has impact too.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On 11 December 2014 at 03:33, tomas.kalas <ka...@email.cz> wrote:
> Thanks for help, but how wrote Alex, I used synonm filter and it is what i
> want. When i wrote to synonym for example Hello, Hi. And sentence is Hello
> how are you and my query is Hi how are you, so that find it too.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4173690.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by "tomas.kalas" <ka...@email.cz>.

Thanks for help, but how wrote Alex, I used synonm filter and it is what i
want. When i wrote to synonym for example Hello, Hi. And sentence is Hello
how are you and my query is Hi how are you, so that find it too.



--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4173690.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by "tomas.kalas" <ka...@email.cz>.

Thanks for your help.
Ok i try it explain one more, sorry for my english.
I need to some functions in my searching.

1.) I will have a lot of documents naturally and i want find out if is for
example is phrase for example to 5  words apart. I used w:"Good morning"~5.
(in example solr it works, but i don't know how do it at my project).

2.) Find some word(phrase) to a certain time, for example Good morning to
time 5.25

3.) And if it is possible order of the words. I'm using solarium client for
highlight and I want to highlight words in this order Hello How Are you for
example, then in this field are words *hello* you are * how are you* and if
the searching word is not in order, then skip it, but it not necessary,
primary i have problem with first 2 points. 

How i make ideal schema and parse data for source file.

I've done some demo with basic searching in one page i have form and results
are links at files by id (i have id as filename) and when i clicked at link
i set a parameter query and in result page i get a necessary data for
display result.

And result file is table with all rewrite interview whit highlighted results
.

Thanks for help.



--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4166793.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Design optimal Solr Schema

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I am afraid, it is not very clear what you are trying to do here (the
sentence below). Could you explain again the business level results.
Are you trying to search for words within particular given time range?
Can those words span the segments? Or are you trying to find segments
with all their words from given segments.

Your Solr design should be driven by what you want to find, not what
you have to index.

Regards,
   Alex.
On 30 October 2014 10:27, tomas.kalas <ka...@email.cz> wrote:
> I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0 TO 1.57]).