You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by "Morgenweck, William" <mo...@musc.edu> on 2019/03/29 01:47:58 UTC

index publication articles

I need to ask this question because I think it might be something Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that are considered Cancer Terms by the NCI https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
I have the terms stored locally but I need to index articles that I have downloaded and count the number of times each word appears in the article.  The purpose for this is to determine if the article is Cancer Related.  I work for a NCI Designated Cancer Center and I need a way to analyze the our Researchers publications that are Members of our Cancer Center.  I know that a slow way to do this is to loop each and every word and see if the indexof give a positive result or I have found a suggestion of creating a match criteria using Regex with all 8,000 words.

But I feel that if I Index the Cancer Terms using Lucene.net I should be able to do the same thing but faster????

If I'm totally off the mark just let me know.  I've been on the user group for over 15 years and love the potential.

Thanks,
Bill





-------------------------------------------------------------------------
This message was secured via TLS by MUSC.

Re: index publication articles

Posted by Erik Hatcher <er...@gmail.com>.

I’m late as well.   My suggestion is to use Solr and the Solr Tagger.   Index the terms into a collection.  Send docs to the tagger endpoint and it’ll tag ‘em giving you the location and terms.  

   Erik

> On May 13, 2019, at 14:08, Andy Pook <an...@gmail.com> wrote:
> 
> A little late to this party...
> 
> Another approach is to add a custom tokenizer. This will add an extra token
> (with a special word, like "ccc") for the same position when it hits one of
> your key words or phrases. As a result you can just search for "ccc" this
> will then return all docs that contain any of your words. You also have an
> index where you can do general searches perhaps in combination with the
> special token (such as "ccc AND ufo" to find out why ufo's cause cancer :)
> 
> At a previous gig we had a whole taxonomy of words and phrases that were
> tagged this way. Then searches could be made on concepts and abstractions
> rather than complex combinations of brackets ANDs and ORs.
> 
>> On Fri, 29 Mar 2019 at 21:39, Morgenweck <mo...@gmail.com> wrote:
>> 
>> Thanks to everyone-- because it is a set number of documents(about 1000)
>> and a set number of words (8000+) and time does not matter initially I'm
>> going to go with Regex initially.  I found a company that
>> https://bytescout.com/we-fight-against-cancer will donate their PDF
>> extraction software and will work with me developing the Regex.  The number
>> of hits for each word will be stored as meta data for each of the
>> articles.  Since I have total control over the words and it needs to be run
>> with each word only I can run this as a job or a nightly process and save
>> the data.  Once done it's not used until a new article appears and only for
>> that one.
>> 
>> In regards to the nuclear reactor coffee maker- I loved it-- but did you
>> ever have that feeling that you are just missing something?  And what I was
>> thinking in the back of my mind is what  Lang- said.  Index the 8000
>> words.I do plan on doing this in the next step where a new researcher will
>> come to the Cancer Center and search for words that they create nrather
>> than being limited to the 8000.  Topics that they can find were other
>> researchers that work in their same type of area. That process is where
>> Lucene.net will shine.
>> 
>> Thank you all again
>> 
>>> On Fri, Mar 29, 2019 at 10:48 AM Jörg Lang <jl...@evelix.ch> wrote:
>>> 
>>> Hi
>>> 
>>> I wouldn't go with a regex. Because it only has a hit, if the match is
>>> 100%.
>>> Using Lucene you can assign a language analyzer in indexing the
>> documents.
>>> When doing searches for your keywords you get hits for plural/singular
>> and
>>> even verb declinations are considered.
>>> 
>>> This of course at the cost, that you might get a few hits where you
>>> personally wouldn't mark it as a hit. But this is the general price of a
>>> full text search.
>>> 
>>> An idea worth exploring:
>>> - Create a document with your list of 8000 terms.
>>> - Have it indexed, with all the other documents
>>> - Do a "more like this" query giving your "terms" document as input
>>> - You get a list with documents that contain similar words like the
>> source
>>> document. The most relevant documents ranked first.
>>> 
>>> You can read about "moreLikeThis" here
>>> 
>>> 
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
>>> 
>>> 
>> https://lucene.apache.org/core/7_3_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
>>> 
>>> This might give you also some input.
>>> 
>>> Joerg
>>> 
>>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: Morgenweck, William <mo...@musc.edu>
>>> Gesendet: Freitag, 29. März 2019 02:48
>>> An: user@lucenenet.apache.org
>>> Betreff: index publication articles
>>> 
>>> I need to ask this question because I think it might be something
>>> Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that
>> are
>>> considered Cancer Terms by the NCI
>>> https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
>>> I have the terms stored locally but I need to index articles that I have
>>> downloaded and count the number of times each word appears in the
>> article.
>>> The purpose for this is to determine if the article is Cancer Related.  I
>>> work for a NCI Designated Cancer Center and I need a way to analyze the
>> our
>>> Researchers publications that are Members of our Cancer Center.  I know
>>> that a slow way to do this is to loop each and every word and see if the
>>> indexof give a positive result or I have found a suggestion of creating a
>>> match criteria using Regex with all 8,000 words.
>>> 
>>> But I feel that if I Index the Cancer Terms using Lucene.net I should be
>>> able to do the same thing but faster????
>>> 
>>> If I'm totally off the mark just let me know.  I've been on the user
>> group
>>> for over 15 years and love the potential.
>>> 
>>> Thanks,
>>> Bill
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -------------------------------------------------------------------------
>>> This message was secured via TLS by MUSC.
>>> 
>>

Re: index publication articles

Posted by Andy Pook <an...@gmail.com>.

A little late to this party...

Another approach is to add a custom tokenizer. This will add an extra token
(with a special word, like "ccc") for the same position when it hits one of
your key words or phrases. As a result you can just search for "ccc" this
will then return all docs that contain any of your words. You also have an
index where you can do general searches perhaps in combination with the
special token (such as "ccc AND ufo" to find out why ufo's cause cancer :)

At a previous gig we had a whole taxonomy of words and phrases that were
tagged this way. Then searches could be made on concepts and abstractions
rather than complex combinations of brackets ANDs and ORs.

On Fri, 29 Mar 2019 at 21:39, Morgenweck <mo...@gmail.com> wrote:

> Thanks to everyone-- because it is a set number of documents(about 1000)
> and a set number of words (8000+) and time does not matter initially I'm
> going to go with Regex initially.  I found a company that
> https://bytescout.com/we-fight-against-cancer will donate their PDF
> extraction software and will work with me developing the Regex.  The number
> of hits for each word will be stored as meta data for each of the
> articles.  Since I have total control over the words and it needs to be run
> with each word only I can run this as a job or a nightly process and save
> the data.  Once done it's not used until a new article appears and only for
> that one.
>
> In regards to the nuclear reactor coffee maker- I loved it-- but did you
> ever have that feeling that you are just missing something?  And what I was
> thinking in the back of my mind is what  Lang- said.  Index the 8000
> words.I do plan on doing this in the next step where a new researcher will
> come to the Cancer Center and search for words that they create nrather
> than being limited to the 8000.  Topics that they can find were other
> researchers that work in their same type of area. That process is where
> Lucene.net will shine.
>
> Thank you all again
>
> On Fri, Mar 29, 2019 at 10:48 AM Jörg Lang <jl...@evelix.ch> wrote:
>
> > Hi
> >
> > I wouldn't go with a regex. Because it only has a hit, if the match is
> > 100%.
> > Using Lucene you can assign a language analyzer in indexing the
> documents.
> > When doing searches for your keywords you get hits for plural/singular
> and
> > even verb declinations are considered.
> >
> > This of course at the cost, that you might get a few hits where you
> > personally wouldn't mark it as a hit. But this is the general price of a
> > full text search.
> >
> > An idea worth exploring:
> > - Create a document with your list of 8000 terms.
> > - Have it indexed, with all the other documents
> > - Do a "more like this" query giving your "terms" document as input
> > - You get a list with documents that contain similar words like the
> source
> > document. The most relevant documents ranked first.
> >
> > You can read about "moreLikeThis" here
> >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
> >
> >
> https://lucene.apache.org/core/7_3_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
> >
> > This might give you also some input.
> >
> > Joerg
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Morgenweck, William <mo...@musc.edu>
> > Gesendet: Freitag, 29. März 2019 02:48
> > An: user@lucenenet.apache.org
> > Betreff: index publication articles
> >
> > I need to ask this question because I think it might be something
> > Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that
> are
> > considered Cancer Terms by the NCI
> > https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
> > I have the terms stored locally but I need to index articles that I have
> > downloaded and count the number of times each word appears in the
> article.
> > The purpose for this is to determine if the article is Cancer Related.  I
> > work for a NCI Designated Cancer Center and I need a way to analyze the
> our
> > Researchers publications that are Members of our Cancer Center.  I know
> > that a slow way to do this is to loop each and every word and see if the
> > indexof give a positive result or I have found a suggestion of creating a
> > match criteria using Regex with all 8,000 words.
> >
> > But I feel that if I Index the Cancer Terms using Lucene.net I should be
> > able to do the same thing but faster????
> >
> > If I'm totally off the mark just let me know.  I've been on the user
> group
> > for over 15 years and love the potential.
> >
> > Thanks,
> > Bill
> >
> >
> >
> >
> >
> > -------------------------------------------------------------------------
> > This message was secured via TLS by MUSC.
> >
>

Re: index publication articles

Posted by Morgenweck <mo...@gmail.com>.

Thanks to everyone-- because it is a set number of documents(about 1000)
and a set number of words (8000+) and time does not matter initially I'm
going to go with Regex initially.  I found a company that
https://bytescout.com/we-fight-against-cancer will donate their PDF
extraction software and will work with me developing the Regex.  The number
of hits for each word will be stored as meta data for each of the
articles.  Since I have total control over the words and it needs to be run
with each word only I can run this as a job or a nightly process and save
the data.  Once done it's not used until a new article appears and only for
that one.

In regards to the nuclear reactor coffee maker- I loved it-- but did you
ever have that feeling that you are just missing something?  And what I was
thinking in the back of my mind is what  Lang- said.  Index the 8000
words.I do plan on doing this in the next step where a new researcher will
come to the Cancer Center and search for words that they create nrather
than being limited to the 8000.  Topics that they can find were other
researchers that work in their same type of area. That process is where
Lucene.net will shine.

Thank you all again

On Fri, Mar 29, 2019 at 10:48 AM Jörg Lang <jl...@evelix.ch> wrote:

> Hi
>
> I wouldn't go with a regex. Because it only has a hit, if the match is
> 100%.
> Using Lucene you can assign a language analyzer in indexing the documents.
> When doing searches for your keywords you get hits for plural/singular and
> even verb declinations are considered.
>
> This of course at the cost, that you might get a few hits where you
> personally wouldn't mark it as a hit. But this is the general price of a
> full text search.
>
> An idea worth exploring:
> - Create a document with your list of 8000 terms.
> - Have it indexed, with all the other documents
> - Do a "more like this" query giving your "terms" document as input
> - You get a list with documents that contain similar words like the source
> document. The most relevant documents ranked first.
>
> You can read about "moreLikeThis" here
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
>
> https://lucene.apache.org/core/7_3_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
>
> This might give you also some input.
>
> Joerg
>
>
> -----Ursprüngliche Nachricht-----
> Von: Morgenweck, William <mo...@musc.edu>
> Gesendet: Freitag, 29. März 2019 02:48
> An: user@lucenenet.apache.org
> Betreff: index publication articles
>
> I need to ask this question because I think it might be something
> Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that are
> considered Cancer Terms by the NCI
> https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
> I have the terms stored locally but I need to index articles that I have
> downloaded and count the number of times each word appears in the article.
> The purpose for this is to determine if the article is Cancer Related.  I
> work for a NCI Designated Cancer Center and I need a way to analyze the our
> Researchers publications that are Members of our Cancer Center.  I know
> that a slow way to do this is to loop each and every word and see if the
> indexof give a positive result or I have found a suggestion of creating a
> match criteria using Regex with all 8,000 words.
>
> But I feel that if I Index the Cancer Terms using Lucene.net I should be
> able to do the same thing but faster????
>
> If I'm totally off the mark just let me know.  I've been on the user group
> for over 15 years and love the potential.
>
> Thanks,
> Bill
>
>
>
>
>
> -------------------------------------------------------------------------
> This message was secured via TLS by MUSC.
>

AW: index publication articles

Posted by Jörg Lang <jl...@evelix.ch>.

I wouldn't go with a regex. Because it only has a hit, if the match is 100%.
Using Lucene you can assign a language analyzer in indexing the documents. When doing searches for your keywords you get hits for plural/singular and even verb declinations are considered.

This of course at the cost, that you might get a few hits where you personally wouldn't mark it as a hit. But this is the general price of a full text search.

An idea worth exploring:
- Create a document with your list of 8000 terms.
- Have it indexed, with all the other documents
- Do a "more like this" query giving your "terms" document as input
- You get a list with documents that contain similar words like the source document. The most relevant documents ranked first.

You can read about "moreLikeThis" here
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
https://lucene.apache.org/core/7_3_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html

This might give you also some input.

Joerg

-----Ursprüngliche Nachricht-----
Von: Morgenweck, William <mo...@musc.edu>
Gesendet: Freitag, 29. März 2019 02:48
An: user@lucenenet.apache.org
Betreff: index publication articles

I need to ask this question because I think it might be something Lucene.Net can do but I'm not sure. I have a list of 8,000+ words that are considered Cancer Terms by the NCI https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
I have the terms stored locally but I need to index articles that I have downloaded and count the number of times each word appears in the article. The purpose for this is to determine if the article is Cancer Related. I work for a NCI Designated Cancer Center and I need a way to analyze the our Researchers publications that are Members of our Cancer Center. I know that a slow way to do this is to loop each and every word and see if the indexof give a positive result or I have found a suggestion of creating a match criteria using Regex with all 8,000 words.

But I feel that if I Index the Cancer Terms using Lucene.net I should be able to do the same thing but faster????

If I'm totally off the mark just let me know. I've been on the user group for over 15 years and love the potential.

Thanks,
Bill

-------------------------------------------------------------------------
This message was secured via TLS by MUSC.

Re: index publication articles

Posted by P Burrows <pb...@gmail.com>.

I agree with Chris and Jens that Lucene is overkill, except for one thing:
article relevance.

Instead of searching for single keywords, if you search for all the words
at once (or a subset of them if Lucene can't handle all 8000 search
terms*), you will get results back based on their relevance. Articles with
very low relevance may only mention one keyword one time, and might not be
about cancer at all (for instance, the word "cancer" itself does not always
just refer to the disease).

* not sure how many terms Lucene can take at a time in one query, but in
the worst case, you can divide the words into n-grams and run those in a
loop. multi-word relevance still beats single-word searching, even for just
2-grams.

--
with excessive obsequiousness,

Patrick Burrows
http://www.BurrowsCorp.com

On Fri, Mar 29, 2019 at 9:51 AM Chris Moschini <ch...@brass9.com> wrote:

> You could use Lucene, but if you have to ask, you're sort of asking if you
> could build a nuclear reactor to power your coffee maker. Yes... but,
> you're overcomplicating it.
>
> I agree a simple Regex is the fastest way - in fact I think you can do all
> of this with grep in Bash, which, if you're on Windows you can get by just
> installing Git for Windows/MSysGit, or whatever your favorite programming
> language is, I'm sure you could whip up the regex and loop in a few
> minutes.
>
> I dislike when people tell me "Yes but don't do it" without answering how,
> so you shouldn't do this, but if you wanted, you'd fire up a Lucene
> Analyzer, then an Index Writer, feed the Writer the documents, then use
> Terms to ask about each of the 8000 words. Store the results in an array or
> db or whatever data structure you like. Done.
>
> But if you're not familiar with that Analyzer, Index Writer, etc, your
> shorter path is the above Regex stuff.
>

Re: index publication articles

Posted by Chris Moschini <ch...@brass9.com>.

You could use Lucene, but if you have to ask, you're sort of asking if you
could build a nuclear reactor to power your coffee maker. Yes... but,
you're overcomplicating it.

I agree a simple Regex is the fastest way - in fact I think you can do all
of this with grep in Bash, which, if you're on Windows you can get by just
installing Git for Windows/MSysGit, or whatever your favorite programming
language is, I'm sure you could whip up the regex and loop in a few minutes.

I dislike when people tell me "Yes but don't do it" without answering how,
so you shouldn't do this, but if you wanted, you'd fire up a Lucene
Analyzer, then an Index Writer, feed the Writer the documents, then use
Terms to ask about each of the 8000 words. Store the results in an array or
db or whatever data structure you like. Done.

But if you're not familiar with that Analyzer, Index Writer, etc, your
shorter path is the above Regex stuff.

Re: index publication articles

Posted by 小康 <xi...@cnblogs.com>.

Hi.
Lucene.Net can find if there is a specific term in this article. The  more
number of times the term appears int this article , the more higher ranking
this article can get.

You can assign a weight to term query.And the result article's weight
represent the number of times the specific term appears in reslut article.



Morgenweck, William <mo...@musc.edu> 于2019年3月29日周五 下午1:07写道：

> I need to ask this question because I think it might be something
> Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that are
> considered Cancer Terms by the NCI
> https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
> I have the terms stored locally but I need to index articles that I have
> downloaded and count the number of times each word appears in the article.
> The purpose for this is to determine if the article is Cancer Related.  I
> work for a NCI Designated Cancer Center and I need a way to analyze the our
> Researchers publications that are Members of our Cancer Center.  I know
> that a slow way to do this is to loop each and every word and see if the
> indexof give a positive result or I have found a suggestion of creating a
> match criteria using Regex with all 8,000 words.
>
> But I feel that if I Index the Cancer Terms using Lucene.net I should be
> able to do the same thing but faster????
>
> If I'm totally off the mark just let me know.  I've been on the user group
> for over 15 years and love the potential.
>
> Thanks,
> Bill
>
>
>
>
>
> -------------------------------------------------------------------------
> This message was secured via TLS by MUSC.
>

RE: index publication articles

Posted by Jens Melgaard <Je...@Systematic.com>.

To be honest, Lucene doesn't really seem to be useful here even though it could be used. At least not as you outline the problem.

You mention that a slow way to do this is to loop though every word, but if you use Lucene, this will happen anyways as that is the way it indexes your document. By looping though the document and recording every single term (if relevant depending on the analyzer).

That being said, using one of Lucenes Analyzer may very well be useful though, as you can normalize terms with it, which means that it can help where terms are in plural or singular form if that happens and is important to catch (I don't know the particular fields that well so can't tell, but E.g. it would recognize Car and Cars as the same term as well as ignore casing etc.)

I would keep this simple and just use a Dictionary over all 8000 terms, use one of lucenes analyzers, a fitting one - perhaps the Standard one is fine. Then loop the words in the document and every time you hit a term match count that one up.

----

Med venlig hilsen / Kind regards

Jens Melgaard
Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@Systematic.com

-----Original Message-----
From: Morgenweck, William <mo...@musc.edu> 
Sent: 29. marts 2019 02:48
To: user@lucenenet.apache.org
Subject: index publication articles

I need to ask this question because I think it might be something Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that are considered Cancer Terms by the NCI https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
I have the terms stored locally but I need to index articles that I have downloaded and count the number of times each word appears in the article.  The purpose for this is to determine if the article is Cancer Related.  I work for a NCI Designated Cancer Center and I need a way to analyze the our Researchers publications that are Members of our Cancer Center.  I know that a slow way to do this is to loop each and every word and see if the indexof give a positive result or I have found a suggestion of creating a match criteria using Regex with all 8,000 words.

But I feel that if I Index the Cancer Terms using Lucene.net I should be able to do the same thing but faster????

If I'm totally off the mark just let me know.  I've been on the user group for over 15 years and love the potential.

Thanks,
Bill





-------------------------------------------------------------------------
This message was secured via TLS by MUSC.