You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dmitry <dm...@hotmail.com> on 2007/07/26 05:38:04 UTC

Displaying results in the order

Is there a way to update a document in the Index without causing any change
to the order in which it comes up in searches?

thanks,
DT,
www.ejinz.com
Search everything
news, tech, movies, music


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Linear Hashing in Lucene?

Posted by Dmitry <dm...@hotmail.com>.

Karl,
Thanks for info, its very difficult to find something about Orion Algorithm 
and Linear Hashing.
I will check the thread .

DT,
www.ejinz.com
Search Engine Advertisement
----- Original Message ----- 
From: "karl wettin" <ka...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Thursday, July 26, 2007 3:49 PM
Subject: Re: Linear Hashing in Lucene?


>
> 26 jul 2007 kl. 05.56 skrev Dmitry:
>
>> 1. does exist Ontology Wraper in Lucene implementation?
>
> Not publically available as far as I know. There have been some 
> discussion on the forums though, you could try to search for OWL, RDF  or 
> something using Nabble and get in touch with the authors of those  threads 
> to see if they came up with something you can use. Feel free  to report 
> with your success or failure in locating something.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Linear Hashing in Lucene?

Posted by karl wettin <ka...@gmail.com>.

26 jul 2007 kl. 05.56 skrev Dmitry:

> 1. does exist Ontology Wraper in Lucene implementation?

Not publically available as far as I know. There have been some  
discussion on the forums though, you could try to search for OWL, RDF  
or something using Nabble and get in touch with the authors of those  
threads to see if they came up with something you can use. Feel free  
to report with your success or failure in locating something.

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Linear Hashing in Lucene?

Posted by Dmitry <dm...@hotmail.com>.

Hey,
Some common questions about Lucene.
1. does exist Ontology Wraper in Lucene implementation?
2. Does Lucene using Linear Hashing?

thnaks,
DT,
www.ejinz.com
Search news

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Detection of index dublicates in Lucene

Posted by Michael Stoppelman <st...@gmail.com>.

A couple of thoughts here...

You could hash (e.g.md5) all the documents in your index and eliminate
duplicates that way. Just pick one of the docs in the hash bucket as
the non-dup document and the delete the other dups. This could be run as a
batch job to eliminate the duplicates in an off-line process.

Alternatively, if you don't want to run over the entire index all the time,
you could add a field that contains some kind of hash (e.g. md5) as a field
in
lucene Document. At query time you could make sure all the Documents
returned are unique.

-M

On 7/28/07, Dmitry <dm...@hotmail.com> wrote:
>
> We trying to find are any implementation for Lucene  -  detection index
> duclicates.
> Assuming we have a set of documents and a document is a bunch of words.
> After we created indexec for the same document we need to knwo that all
> ideces will be uniq for specific document. (lexical equivalency).
>
> Can we have like implementation of algorithm  has not determined a
> duplicate
> and another situation when algorithm has offered a false duplicate. In
> this
> case we can find all dublicate indeces.
>
> And the same Algorithm we can use to detect Document dublicates - in this
> case we save time and can get better performance not to run indexed
> services
> against this document.
>
> Please any suggestions will be good.
>
> Thanks,
>
> DT,
>
> www.ejinz.com
>
> Search Engine News
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Detection of index dublicates in Lucene

Posted by karl wettin <ka...@gmail.com>.

30 jul 2007 kl. 14.43 skrev Grant Ingersoll:

> I believe Nutch has a duplicate detection algorithm.  I don't know  
> how easy it would be to run independently on a Lucene index.

There have also been a bunch of near-duplicate ideas that have been  
presented on the forums before.

This is one of the threads: <http://www.nabble.com/Checking-for- 
duplicates-inside-index-tf1665494.html>


-- 
karl


>
> -Grant
>
> On Jul 29, 2007, at 2:18 AM, Dmitry wrote:
>
>> We trying to find are any implementation for Lucene  -  detection  
>> index duclicates.
>> Assuming we have a set of documents and a document is a bunch of  
>> words. After we created indexec for the same document we need to  
>> knwo that all ideces will be uniq for specific document. (lexical  
>> equivalency).
>>
>> Can we have like implementation of algorithm  has not determined a  
>> duplicate and another situation when algorithm has offered a false  
>> duplicate. In this case we can find all dublicate indeces.
>>
>> And the same Algorithm we can use to detect Document dublicates -  
>> in this case we save time and can get better performance not to  
>> run indexed services against this document.
>>
>> Please any suggestions will be good.
>>
>> Thanks,
>>
>> DT,
>>
>> www.ejinz.com
>>
>> Search Engine News
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Detection of index dublicates in Lucene

Posted by Grant Ingersoll <gs...@apache.org>.

I believe Nutch has a duplicate detection algorithm.  I don't know  
how easy it would be to run independently on a Lucene index.

-Grant

On Jul 29, 2007, at 2:18 AM, Dmitry wrote:

> We trying to find are any implementation for Lucene  -  detection  
> index duclicates.
> Assuming we have a set of documents and a document is a bunch of  
> words. After we created indexec for the same document we need to  
> knwo that all ideces will be uniq for specific document. (lexical  
> equivalency).
>
> Can we have like implementation of algorithm  has not determined a  
> duplicate and another situation when algorithm has offered a false  
> duplicate. In this case we can find all dublicate indeces.
>
> And the same Algorithm we can use to detect Document dublicates -  
> in this case we save time and can get better performance not to run  
> indexed services against this document.
>
> Please any suggestions will be good.
>
> Thanks,
>
> DT,
>
> www.ejinz.com
>
> Search Engine News
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Detection of index dublicates in Lucene

Posted by Dmitry <dm...@hotmail.com>.

We trying to find are any implementation for Lucene  -  detection index 
duclicates.
Assuming we have a set of documents and a document is a bunch of words. 
After we created indexec for the same document we need to knwo that all 
ideces will be uniq for specific document. (lexical equivalency).

Can we have like implementation of algorithm  has not determined a duplicate 
and another situation when algorithm has offered a false duplicate. In this 
case we can find all dublicate indeces.

And the same Algorithm we can use to detect Document dublicates - in this 
case we save time and can get better performance not to run indexed services 
against this document.

Please any suggestions will be good.

Thanks,

DT,

www.ejinz.com

Search Engine News




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene integration with PDM Windchill (Product Data Management System)

Posted by Dmitry <dm...@hotmail.com>.

Karl,

And by the way we created one of the solution  - but we need to have more 
embedded interfaces  / implementation fucntions to the PDM Widchill. 
(article you can find on www.profilesmagazine.com for one of the solution).

Thanks,
DT,
www.ejinz.com
Search Engine Platform News

----- Original Message ----- 
From: "Dmitry" <dm...@hotmail.com>
To: <ja...@lucene.apache.org>
Sent: Saturday, July 28, 2007 6:56 PM
Subject: Re: lucene integration with PDM Windchill (Product Data Management 
System)


> Karl,
>
> thanks for help.
>
> I will try to explain requirements. There is system PDM - product Data 
> Management System - which manages the data related to products, supports 
> precudures druing the product lifecycle deals with the development and 
> production infrastructure.
> There following design paterns which can be used for implementation and 
> customization of this system  - point to integrate Lucene with Windchill 
> (PDM):
>
> - The Object Reference Design Pattern (encapsulates details concerning 
> persistable objects and their unique databse keys- like for Objects 
> WTParts, WTAssemblies, WTDocuments and their relations)
> - Business Service Design Pattern (building windchill services - like 
> using remote interfaces service Events)
> - Master-iteration Design Pattern (like two objects: Mastered and 
> Iterated: adhare all versioned data in Windchill PDM system <PDMLink> and 
> <ProjectLink>)
>
> We need use Lucene with WTObeject like WTDocuments && WTParts to create 
> indexes and provide search service.
> The properties of a document for instance - MS Documnts or PDF files are 
> specified on the WTDocument class. But the sum of the propertiesare stored 
> on WTDocumnetMaster. WTDocument implements ContenrHolder - primary content 
> and secondary content. And by the way WTDocument can create two types of 
> relationships to other documents using WTDocumentUsageLink (build 
> relationships between documents and document structure). Atributes which 
> need to be used for searching are on either  WTDocumentMaster or 
> WTDocument clasess.
>
> This was just short desription of architecture of PDMLink  - Windchill.
> So we need create  some Lucene services(processors) embedded to the system 
> using extended interfaces for creation indexes and Search all Documnets by 
> Attributes.
>
> thanks,
> Dmitry
> www.ejinz.com
> Search Engine News
>
> --------------------------------------------------------------------
>
>> Trying to integrate PDM system : WTPart obejct with Lucene indexing 
>> search framework. Part of the work is integration with persistent layer 
>> <hibernate>+ indeces storage+ mysql
>
> You have a product data management software of some sort that use
> MySQL via Hibernate for persistency, and now you want to use Lucene
> to search the data?
>
>> Could not find a good solution ...
>
> What possible solutions did you find, and what was the problem with
> these?
>
>> please advice
>
> My guess is that very few people in this forum knows how Windchill
> works, what a WTPart is or what the complexity of the system is in a
> persistent state. So you probably need to explain your requirements a
> bit more in order to get a helpful answer or discussion going.
>
>
> -- 
> karl
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene integration with PDM Windchill (Product Data Management System)

Posted by karl wettin <ka...@gmail.com>.

I'm not sure I understand how all the objects you talk to relate to  
each other, nor do I see a concrete question in your post.

Is your problem that you do not know how to denormalize the object  
grapgs in order to represent them as Lucene documents that make sense  
to search in? Or are you asking how to implement an ad hoc service  
for your PDM that communicate with Lucene?

The latter is probably a question we will not be able to help out  
with more than explaining how readers, writers and searchers work.


29 jul 2007 kl. 01.56 skrev Dmitry:

> Karl,
>
> thanks for help.
>
> I will try to explain requirements. There is system PDM - product  
> Data Management System - which manages the data related to  
> products, supports precudures druing the product lifecycle deals  
> with the development and production infrastructure.
> There following design paterns which can be used for implementation  
> and customization of this system  - point to integrate Lucene with  
> Windchill (PDM):
>
> - The Object Reference Design Pattern (encapsulates details  
> concerning persistable objects and their unique databse keys- like  
> for Objects WTParts, WTAssemblies, WTDocuments and their relations)
> - Business Service Design Pattern (building windchill services -  
> like using remote interfaces service Events)
> - Master-iteration Design Pattern (like two objects: Mastered and  
> Iterated: adhare all versioned data in Windchill PDM system  
> <PDMLink> and <ProjectLink>)
>
> We need use Lucene with WTObeject like WTDocuments && WTParts to  
> create indexes and provide search service.
> The properties of a document for instance - MS Documnts or PDF  
> files are specified on the WTDocument class. But the sum of the  
> propertiesare stored on WTDocumnetMaster. WTDocument implements  
> ContenrHolder - primary content and secondary content. And by the  
> way WTDocument can create two types of relationships to other  
> documents using WTDocumentUsageLink (build relationships between  
> documents and document structure). Atributes which need to be used  
> for searching are on either  WTDocumentMaster or WTDocument clasess.
>
> This was just short desription of architecture of PDMLink  -  
> Windchill.
> So we need create  some Lucene services(processors) embedded to the  
> system using extended interfaces for creation indexes and Search  
> all Documnets by Attributes.
>
> thanks,
> Dmitry
> www.ejinz.com
> Search Engine News
>
> --------------------------------------------------------------------
>
>> Trying to integrate PDM system : WTPart obejct with Lucene  
>> indexing search framework. Part of the work is integration with  
>> persistent layer <hibernate>+ indeces storage+ mysql
>
> You have a product data management software of some sort that use
> MySQL via Hibernate for persistency, and now you want to use Lucene
> to search the data?
>
>> Could not find a good solution ...
>
> What possible solutions did you find, and what was the problem with
> these?
>
>> please advice
>
> My guess is that very few people in this forum knows how Windchill
> works, what a WTPart is or what the complexity of the system is in a
> persistent state. So you probably need to explain your requirements a
> bit more in order to get a helpful answer or discussion going.
>
>
> -- 
> karl
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene integration with PDM Windchill (Product Data Management System)

Posted by Dmitry <dm...@hotmail.com>.

Karl,

thanks for help.

I will try to explain requirements. There is system PDM - product Data 
Management System - which manages the data related to products, supports 
precudures druing the product lifecycle deals with the development and 
production infrastructure.
There following design paterns which can be used for implementation and 
customization of this system  - point to integrate Lucene with Windchill 
(PDM):

 - The Object Reference Design Pattern (encapsulates details concerning 
persistable objects and their unique databse keys- like for Objects WTParts, 
WTAssemblies, WTDocuments and their relations)
 - Business Service Design Pattern (building windchill services - like using 
remote interfaces service Events)
 - Master-iteration Design Pattern (like two objects: Mastered and Iterated: 
adhare all versioned data in Windchill PDM system <PDMLink> and 
<ProjectLink>)

We need use Lucene with WTObeject like WTDocuments && WTParts to create 
indexes and provide search service.
The properties of a document for instance - MS Documnts or PDF files are 
specified on the WTDocument class. But the sum of the propertiesare stored 
on WTDocumnetMaster. WTDocument implements ContenrHolder - primary content 
and secondary content. And by the way WTDocument can create two types of 
relationships to other documents using WTDocumentUsageLink (build 
relationships between documents and document structure). Atributes which 
need to be used for searching are on either  WTDocumentMaster or WTDocument 
clasess.

This was just short desription of architecture of PDMLink  - Windchill.
 So we need create  some Lucene services(processors) embedded to the system 
using extended interfaces for creation indexes and Search all Documnets by 
Attributes.

thanks,
Dmitry
www.ejinz.com
Search Engine News

--------------------------------------------------------------------

> Trying to integrate PDM system : WTPart obejct with Lucene indexing search 
> framework. Part of the work is integration with persistent layer 
> <hibernate>+ indeces storage+ mysql

You have a product data management software of some sort that use
MySQL via Hibernate for persistency, and now you want to use Lucene
to search the data?

> Could not find a good solution ...

What possible solutions did you find, and what was the problem with
these?

> please advice

My guess is that very few people in this forum knows how Windchill
works, what a WTPart is or what the complexity of the system is in a
persistent state. So you probably need to explain your requirements a
bit more in order to get a helpful answer or discussion going.


-- 
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Displaying results in the order

Posted by karl wettin <ka...@gmail.com>.

26 jul 2007 kl. 05.38 skrev Dmitry:

> Is there a way to update a document in the Index without causing  
> any change
> to the order in which it comes up in searches?

I would say no, the score is calculated based on the matching terms,  
content length, et c. For details see
<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
javadoc/org/apache/lucene/search/Similarity.html>

You might want to explain why you want to do this.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org