You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by James Strassburg <js...@gmail.com> on 2015/03/19 19:37:51 UTC
Re: Have anyone used Automatic Phrase Tokenization
(AutoPhrasingTokenFilterFactory) ?
Sorry, I've been a bit unfocused from this list for a bit. When I was
working with the APTF code I rewrote a big chunk of it and didn't include
the inclusion of the original tokens as I didn't need it at the time. That
feature could easily be added back in. I will see if I can find a bit of
time for that.
As for the other part of your message, are you suggesting that the token
indexes are not correct? There is a bit of a formatting issue with the text
and I'm not sure what you're getting at. Can you explain further please?
On Sun, Feb 8, 2015 at 3:04 PM, trhodesg <tr...@gmail.com> wrote:
> Thanks to everyone for the thought, time and effort put into
> AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
> While trying to add APTF to my indexing, i discovered that the original
> (TS)
> version throws an exception while indexing a 100MB PDF. The error
> isException writing document to the index; possible analysis errorThe
> modified (JS) version runs without error, but it removes the tokens used to
> create the phrase. They are needed.
> Before looking into this i have a question; Solr would normally tokenize
> the
> phrasethe peoples republic of china isasthe(1) peoples(2) republic(3) of(4)
> china(5) is(6)
> Defining the APTF phrase file asthe Solr admin analysis page reports that
> the APTF indexer tokenizes the phrase asWould it be possible for someone to
> explain the reasoning behind the discontinuous token numbering? As it is
> now
> phrase queries such as "republic of china" will fail. And i can't get
> proximity queries like "republic of"~10 to work either (though it seems
> they
> should). Wouldn't it be more flexible to return the following
> tokenizationThis allows spurious matches such as "peoples peoplesrepublic"
> but it seems like this type of event would be very rare. It has the
> advantage of allowing phrase queries to continue working the way most users
> think.
> Thank you for supporting more than one entity definition per phrase (ie
> peoplesrepublic and peoplesrepublicofchina). This is type of contraction is
> common in longer documents, especially when the first used phrase ends with
> a preposition. It helps support robust matching.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: Have anyone used Automatic Phrase Tokenization
(AutoPhrasingTokenFilterFactory) ?
Posted by RohanaR <ro...@gmail.com>.
Has this been fixed now so that phrase queries given in double quotes work? I
am trying this and encountered the same problem due to original order of
tokens in the index are not preserved. How can I fix this (if not fixed
yet)?
--
View this message in context: http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4234059.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Have anyone used Automatic Phrase Tokenization
(AutoPhrasingTokenFilterFactory) ?
Posted by afrooz <af...@gmail.com>.
Hi,
I am a .net developer, but i need to use solr and specifically this good
plugin "AutoPhrasingTokenFilter".
I searched everywhere and i couldn't get useful information, can any one
help me to run it in solr 5.0 or even previous versions. I am not able to
add it to my solr it is throwing below error while i am putting the Lib
folder under the core which contains also my jar files for the
"AutoPhrasingTokenFilter"
Error:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
JVM Error creating core [gettingstarted_shard1_replica1]: class
org.apache.lucene.codecs.diskdv.DiskDocValuesFormat$1 cannot access its
superclass org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer
--
View this message in context: http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4195182.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Have anyone used Automatic Phrase Tokenization
(AutoPhrasingTokenFilterFactory) ?
Posted by James Strassburg <js...@gmail.com>.
I have an autophrase configured for 'wheel chair' and if I run analysis for
'super wheel chair awesome' such that it would index to 'super wheelchair
awesome' this is how mine behaves:
http://i.imgur.com/iR4IgGp.png
When I did the implementation that is how I thought the positioning should
work. Do you think it should be different?
On Fri, Mar 20, 2015 at 11:10 AM, trhodesg <tr...@gmail.com> wrote:
>
>
>
>
>
> Sorry, i can see my post is munged.
> This seems to display it legibly
>
>
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-td4173808.html
>
> I'm new to all this, so i hesitate to say the indexing isn't
> correct. But my understanding is the query, "republic
> of china", will only match
> the indexing, republic(n) of(n+1) china(n+2) Since
> the original APTF indexes this as republic(n) of(n+3) china(n+7)
> that query will fail. Wouldn't it be more logical to leave the
> original token numbering unchanged and just add the phrase token
> with the same number as the last word in the matched series?
>
> BTW, i looked at your code re this. It is quite informative to a
> newbie. Thanks!
>
>
> On 3/19/2015 11:38 AM, James Strassburg [via Lucene] wrote:
>
> Sorry, I've been a bit unfocused from this list for a
> bit. When I was
>
> working with the APTF code I rewrote a big chunk of it and didn't
> include
>
> the inclusion of the original tokens as I didn't need it at the
> time. That
>
> feature could easily be added back in. I will see if I can find a
> bit of
>
> time for that.
>
>
> As for the other part of your message, are you suggesting that the
> token
>
> indexes are not correct? There is a bit of a formatting issue with
> the text
>
> and I'm not sure what you're getting at. Can you explain further
> please?
>
>
> On Sun, Feb 8, 2015 at 3:04 PM, trhodesg < [hidden email] >
> wrote:
>
>
> > Thanks to everyone for the thought, time and effort put
> into
>
> > AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
>
> > While trying to add APTF to my indexing, i discovered that
> the original
>
> > (TS)
>
> > version throws an exception while indexing a 100MB PDF. The
> error
>
> > isException writing document to the index; possible
> analysis errorThe
>
> > modified (JS) version runs without error, but it removes
> the tokens used to
>
> > create the phrase. They are needed.
>
> > Before looking into this i have a question; Solr would
> normally tokenize
>
> > the
>
> > phrasethe peoples republic of china isasthe(1) peoples(2)
> republic(3) of(4)
>
> > china(5) is(6)
>
> > Defining the APTF phrase file asthe Solr admin analysis
> page reports that
>
> > the APTF indexer tokenizes the phrase asWould it be
> possible for someone to
>
> > explain the reasoning behind the discontinuous token
> numbering? As it is
>
> > now
>
> > phrase queries such as "republic of china" will fail. And i
> can't get
>
> > proximity queries like "republic of"~10 to work either
> (though it seems
>
> > they
>
> > should). Wouldn't it be more flexible to return the
> following
>
> > tokenizationThis allows spurious matches such as "peoples
> peoplesrepublic"
>
> > but it seems like this type of event would be very rare. It
> has the
>
> > advantage of allowing phrase queries to continue working
> the way most users
>
> > think.
>
> > Thank you for supporting more than one entity definition
> per phrase (ie
>
> > peoplesrepublic and peoplesrepublicofchina). This is type
> of contraction is
>
> > common in longer documents, especially when the first used
> phrase ends with
>
> > a preposition. It helps support robust matching.
>
> >
>
> >
>
> >
>
> > --
>
> > View this message in context:
>
> >
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html
> > Sent from the Solr - User mailing list archive at
> Nabble.com.
>
> >
>
>
>
>
>
>
> If you reply to this email, your
> message will be added to the discussion below:
>
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194036.html
>
>
> To unsubscribe from Have anyone used Automatic Phrase
> Tokenization (AutoPhrasingTokenFilterFactory) ?, click
> here .
> NAML
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194205.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: Have anyone used Automatic Phrase Tokenization
(AutoPhrasingTokenFilterFactory) ?
Posted by RohanaR <ro...@gmail.com>.
Has this been fixed now so that phrase queries given in double quotes work? I
am trying this and encountered the same problem due to original order of
tokens in the index are not preserved. How can I fix this (if not fixed
yet)?
RohanaR
--
View this message in context: http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4234058.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Have anyone used Automatic Phrase Tokenization
(AutoPhrasingTokenFilterFactory) ?
Posted by trhodesg <tr...@gmail.com>.
Sorry, i can see my post is munged.
This seems to display it legibly
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-td4173808.html
I'm new to all this, so i hesitate to say the indexing isn't
correct. But my understanding is the query, "republic
of china", will only match
the indexing, republic(n) of(n+1) china(n+2) Since
the original APTF indexes this as republic(n) of(n+3) china(n+7)
that query will fail. Wouldn't it be more logical to leave the
original token numbering unchanged and just add the phrase token
with the same number as the last word in the matched series?
BTW, i looked at your code re this. It is quite informative to a
newbie. Thanks!
On 3/19/2015 11:38 AM, James Strassburg [via Lucene] wrote:
Sorry, I've been a bit unfocused from this list for a
bit. When I was
working with the APTF code I rewrote a big chunk of it and didn't
include
the inclusion of the original tokens as I didn't need it at the
time. That
feature could easily be added back in. I will see if I can find a
bit of
time for that.
As for the other part of your message, are you suggesting that the
token
indexes are not correct? There is a bit of a formatting issue with
the text
and I'm not sure what you're getting at. Can you explain further
please?
On Sun, Feb 8, 2015 at 3:04 PM, trhodesg < [hidden email] >
wrote:
> Thanks to everyone for the thought, time and effort put
into
> AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
> While trying to add APTF to my indexing, i discovered that
the original
> (TS)
> version throws an exception while indexing a 100MB PDF. The
error
> isException writing document to the index; possible
analysis errorThe
> modified (JS) version runs without error, but it removes
the tokens used to
> create the phrase. They are needed.
> Before looking into this i have a question; Solr would
normally tokenize
> the
> phrasethe peoples republic of china isasthe(1) peoples(2)
republic(3) of(4)
> china(5) is(6)
> Defining the APTF phrase file asthe Solr admin analysis
page reports that
> the APTF indexer tokenizes the phrase asWould it be
possible for someone to
> explain the reasoning behind the discontinuous token
numbering? As it is
> now
> phrase queries such as "republic of china" will fail. And i
can't get
> proximity queries like "republic of"~10 to work either
(though it seems
> they
> should). Wouldn't it be more flexible to return the
following
> tokenizationThis allows spurious matches such as "peoples
peoplesrepublic"
> but it seems like this type of event would be very rare. It
has the
> advantage of allowing phrase queries to continue working
the way most users
> think.
> Thank you for supporting more than one entity definition
per phrase (ie
> peoplesrepublic and peoplesrepublicofchina). This is type
of contraction is
> common in longer documents, especially when the first used
phrase ends with
> a preposition. It helps support robust matching.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html
> Sent from the Solr - User mailing list archive at
Nabble.com.
>
If you reply to this email, your
message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194036.html
To unsubscribe from Have anyone used Automatic Phrase
Tokenization (AutoPhrasingTokenFilterFactory) ?, click
here .
NAML
--
View this message in context: http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194205.html
Sent from the Solr - User mailing list archive at Nabble.com.