You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/06/10 05:13:53 UTC

Assign rich-text document's title name from clustering results

Hi,

I'm currently using Solr 5.1, and I'm thinking of ways to allow the system
to automatically give the rich-text documents that are being indexed a
title automatically, instead of user entering it in manually, as we might
have to index a whole folder of documents together, so it is not wise for
the user to enter the title one by one.

I would like to check, if it's possible to run the clustering, get the
results, and use the top score label to be the title of the document?
Apparently, we need to run the clustering prior to the indexing, so I'm not
sure if that is possible.


Regards,
Edwin

Re: Assign rich-text document's title name from clustering results

Posted by Alessandro Benedetti <be...@gmail.com>.
I agree with Upayavira,
Title extraction is an activity independent from Solr.
Furthermore I would say it's easy to extract the title before the Solr
Indexng stage.

When we send the content arrives to Solr Update processors it is already a
String.
If you want to do some clever title extraction, formatting of your original
document definitely helps and it is lost at that point.
A nice fit for Title extraction is your :
Indexing App or
Apache Tika if you would like to add a particular customisation.

Remember Apache Tika is integrated in Solr to provide Content Extraction
from rich text documents.

Cheers

2015-06-10 11:57 GMT+01:00 Upayavira <uv...@odoko.co.uk>:

> It depends a lot on what the documents are. Some document formats have
> metadata that stores a title. Perhaps you can just extract that.
>
> If not, once you've extracted the content, perhaps you could just have a
> special field that is the first n words (followed by an ellipsis).
>
> If you use a clustering algorithm that makes a guess at a name for a
> cluster, you will get a list of names or categories, not something that
> most people would think of as a title.
>
> This really doesn't strike me (yet) as a Solr problem. The problem is
> what info there is in these documents and how you can derive a title (or
> some form of summary?) from them.
>
> If they are all Word documents, do they start with a "Heading" style? In
> which case you could extract that. As I say, most likely this will have
> to be done outside of Solr.
>
> Upayavira
>
> On Wed, Jun 10, 2015, at 10:31 AM, Zheng Lin Edwin Yeo wrote:
> > The main objective here is actually to assign a title to the documents as
> > they are being indexed.
> >
> > We actually found that the cluster labels provides a good information on
> > the key points of the documents, but I'm not sure if we can get a good
> > cluster labels with a single documents.
> >
> > Besides getting from cluster labels, is there other methods which we can
> > use to assign a title?
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 10 June 2015 at 17:16, Alessandro Benedetti
> > <be...@gmail.com>
> > wrote:
> >
> > > Hi Edwin,
> > > let's do this step by step.
> > >
> > > Clustering is problem solved by unsupervised machine learning
> algorithms.
> > > The scope of clustering is to group per similarity a corpus of
> documents,
> > > trying to have meaningful groups for a human being.
> > > Solr currently provides different approaches for *Query Time
> Clustering* (
> > > also known Online Clustering).
> > > There's an out of the box integration that allows you to use
> clustering at
> > > query time on the query results.
> > > Different algorithms can be selected, mainly provided by Carrots2 .
> > >
> > > This algorithms also provide a guess for the cluster name.
> > >
> > > Given this introduction let me see your problem.
> > >
> > > 1) The first part can be solved with a custom UpdateProcessor that will
> > > process the document and add the automatic new title.
> > > Now the problem is, how we want to extract this new title ?
> > > Honestly I can not understand how clustering can fit here …
> > >
> > > 2) Index time clustering is not yet provided in Solr ( I remember
> there was
> > > only an interface ready, but no implementation) .
> > > You should cluster the content before indexing it in Solr using a
> machine
> > > Learning library.
> > > Indexing time clustering is delicate. What will happen to the next
> re-Index
> > > ? Should we cluster everything again ?
> > > This topic must be investigated more.
> > >
> > > Anyway, let me know as the original problem maybe does not require the
> > > clustering.
> > >
> > > Cheers
> > >
> > >
> > > 2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > I'm currently using Solr 5.1, and I'm thinking of ways to allow the
> > > system
> > > > to automatically give the rich-text documents that are being indexed
> a
> > > > title automatically, instead of user entering it in manually, as we
> might
> > > > have to index a whole folder of documents together, so it is not
> wise for
> > > > the user to enter the title one by one.
> > > >
> > > > I would like to check, if it's possible to run the clustering, get
> the
> > > > results, and use the top score label to be the title of the document?
> > > > Apparently, we need to run the clustering prior to the indexing, so
> I'm
> > > not
> > > > sure if that is possible.
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Assign rich-text document's title name from clustering results

Posted by Upayavira <uv...@odoko.co.uk>.
It depends a lot on what the documents are. Some document formats have
metadata that stores a title. Perhaps you can just extract that.

If not, once you've extracted the content, perhaps you could just have a
special field that is the first n words (followed by an ellipsis).

If you use a clustering algorithm that makes a guess at a name for a
cluster, you will get a list of names or categories, not something that
most people would think of as a title.

This really doesn't strike me (yet) as a Solr problem. The problem is
what info there is in these documents and how you can derive a title (or
some form of summary?) from them. 

If they are all Word documents, do they start with a "Heading" style? In
which case you could extract that. As I say, most likely this will have
to be done outside of Solr.
 
Upayavira

On Wed, Jun 10, 2015, at 10:31 AM, Zheng Lin Edwin Yeo wrote:
> The main objective here is actually to assign a title to the documents as
> they are being indexed.
> 
> We actually found that the cluster labels provides a good information on
> the key points of the documents, but I'm not sure if we can get a good
> cluster labels with a single documents.
> 
> Besides getting from cluster labels, is there other methods which we can
> use to assign a title?
> 
> 
> Regards,
> Edwin
> 
> 
> On 10 June 2015 at 17:16, Alessandro Benedetti
> <be...@gmail.com>
> wrote:
> 
> > Hi Edwin,
> > let's do this step by step.
> >
> > Clustering is problem solved by unsupervised machine learning algorithms.
> > The scope of clustering is to group per similarity a corpus of documents,
> > trying to have meaningful groups for a human being.
> > Solr currently provides different approaches for *Query Time Clustering* (
> > also known Online Clustering).
> > There's an out of the box integration that allows you to use clustering at
> > query time on the query results.
> > Different algorithms can be selected, mainly provided by Carrots2 .
> >
> > This algorithms also provide a guess for the cluster name.
> >
> > Given this introduction let me see your problem.
> >
> > 1) The first part can be solved with a custom UpdateProcessor that will
> > process the document and add the automatic new title.
> > Now the problem is, how we want to extract this new title ?
> > Honestly I can not understand how clustering can fit here …
> >
> > 2) Index time clustering is not yet provided in Solr ( I remember there was
> > only an interface ready, but no implementation) .
> > You should cluster the content before indexing it in Solr using a machine
> > Learning library.
> > Indexing time clustering is delicate. What will happen to the next re-Index
> > ? Should we cluster everything again ?
> > This topic must be investigated more.
> >
> > Anyway, let me know as the original problem maybe does not require the
> > clustering.
> >
> > Cheers
> >
> >
> > 2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
> >
> > > Hi,
> > >
> > > I'm currently using Solr 5.1, and I'm thinking of ways to allow the
> > system
> > > to automatically give the rich-text documents that are being indexed a
> > > title automatically, instead of user entering it in manually, as we might
> > > have to index a whole folder of documents together, so it is not wise for
> > > the user to enter the title one by one.
> > >
> > > I would like to check, if it's possible to run the clustering, get the
> > > results, and use the top score label to be the title of the document?
> > > Apparently, we need to run the clustering prior to the indexing, so I'm
> > not
> > > sure if that is possible.
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >

Re: Assign rich-text document's title name from clustering results

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
The main objective here is actually to assign a title to the documents as
they are being indexed.

We actually found that the cluster labels provides a good information on
the key points of the documents, but I'm not sure if we can get a good
cluster labels with a single documents.

Besides getting from cluster labels, is there other methods which we can
use to assign a title?


Regards,
Edwin


On 10 June 2015 at 17:16, Alessandro Benedetti <be...@gmail.com>
wrote:

> Hi Edwin,
> let's do this step by step.
>
> Clustering is problem solved by unsupervised machine learning algorithms.
> The scope of clustering is to group per similarity a corpus of documents,
> trying to have meaningful groups for a human being.
> Solr currently provides different approaches for *Query Time Clustering* (
> also known Online Clustering).
> There's an out of the box integration that allows you to use clustering at
> query time on the query results.
> Different algorithms can be selected, mainly provided by Carrots2 .
>
> This algorithms also provide a guess for the cluster name.
>
> Given this introduction let me see your problem.
>
> 1) The first part can be solved with a custom UpdateProcessor that will
> process the document and add the automatic new title.
> Now the problem is, how we want to extract this new title ?
> Honestly I can not understand how clustering can fit here …
>
> 2) Index time clustering is not yet provided in Solr ( I remember there was
> only an interface ready, but no implementation) .
> You should cluster the content before indexing it in Solr using a machine
> Learning library.
> Indexing time clustering is delicate. What will happen to the next re-Index
> ? Should we cluster everything again ?
> This topic must be investigated more.
>
> Anyway, let me know as the original problem maybe does not require the
> clustering.
>
> Cheers
>
>
> 2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
>
> > Hi,
> >
> > I'm currently using Solr 5.1, and I'm thinking of ways to allow the
> system
> > to automatically give the rich-text documents that are being indexed a
> > title automatically, instead of user entering it in manually, as we might
> > have to index a whole folder of documents together, so it is not wise for
> > the user to enter the title one by one.
> >
> > I would like to check, if it's possible to run the clustering, get the
> > results, and use the top score label to be the title of the document?
> > Apparently, we need to run the clustering prior to the indexing, so I'm
> not
> > sure if that is possible.
> >
> >
> > Regards,
> > Edwin
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Assign rich-text document's title name from clustering results

Posted by Alessandro Benedetti <be...@gmail.com>.
Hi Edwin,
let's do this step by step.

Clustering is problem solved by unsupervised machine learning algorithms.
The scope of clustering is to group per similarity a corpus of documents,
trying to have meaningful groups for a human being.
Solr currently provides different approaches for *Query Time Clustering* (
also known Online Clustering).
There's an out of the box integration that allows you to use clustering at
query time on the query results.
Different algorithms can be selected, mainly provided by Carrots2 .

This algorithms also provide a guess for the cluster name.

Given this introduction let me see your problem.

1) The first part can be solved with a custom UpdateProcessor that will
process the document and add the automatic new title.
Now the problem is, how we want to extract this new title ?
Honestly I can not understand how clustering can fit here …

2) Index time clustering is not yet provided in Solr ( I remember there was
only an interface ready, but no implementation) .
You should cluster the content before indexing it in Solr using a machine
Learning library.
Indexing time clustering is delicate. What will happen to the next re-Index
? Should we cluster everything again ?
This topic must be investigated more.

Anyway, let me know as the original problem maybe does not require the
clustering.

Cheers


2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:

> Hi,
>
> I'm currently using Solr 5.1, and I'm thinking of ways to allow the system
> to automatically give the rich-text documents that are being indexed a
> title automatically, instead of user entering it in manually, as we might
> have to index a whole folder of documents together, so it is not wise for
> the user to enter the title one by one.
>
> I would like to check, if it's possible to run the clustering, get the
> results, and use the top score label to be the title of the document?
> Apparently, we need to run the clustering prior to the indexing, so I'm not
> sure if that is possible.
>
>
> Regards,
> Edwin
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England