You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Mihály Héder <he...@gmail.com> on 2012/09/03 15:44:10 UTC

Introducing BookSpotter Enhancement Engine by Sztaki

Hi!

let me introduce BookSpotter Enhancement Engige by Sztaki:
http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/

Bookspotter uses a selection of 5.6M titles from the British National
Bibliography and the Open Library.
It scans the incoming text, looking for titles, and in case the author
is also mentioned, it produces the corresponding entity annotations
that refer to the proper resource uris of either BNB or OL.

You can check the system out here:
http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter

Thanks to the Early Adopter Program, I was able to buy some student
work hours for data cleaning and for some basic testing.
You might want to read the report on our test set of 25 tests:
http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf

For details, see the blog post!

Any comments are much appreciated!
Cheers,
Mihály

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by harish suvarna <hs...@gmail.com>.

Thanks for all the answers. I was trying to learn from you how much of
'spotting algorithm' (into taxonomies/ontologies/datastore) needs to be
customized. Each domain has it's own heuristics to add to.

-harish

On Thu, Sep 6, 2012 at 8:46 AM, Mihály Héder <he...@gmail.com> wrote:

> Hi!
>
> On 5 September 2012 19:03, harish suvarna <hs...@gmail.com> wrote:
>
> > Hi,
> > Nice work and thanks for sharing.
> >
> > You had quite a good store of book titles of around 5.6million. Why is it
> > that the recall is around 50%.?
> >
>
> Well this 5.6M is a rather small set. No one knows the total number books
> ever written, but google estimates (conservatively) that it is at least 130
> million [1].
>
> And as you can imagine there is a long tail effect if we talk about how
> well known certain books are. This is why you won't easily cover, say 90%
> of the books with even a 50M data set.
> The 5.6 million set is the smallest one I experimented with - I like this
> size because it is easy to handle. To tell you the truth I was quite happy
> with the 50% :)
>
> Anyway, in the long run, it would be much more important to include book
> sets for different languages. Of course, both BNB and OL has some foreign
> titles but they are mostly for English.
>
>
> > Are the dropped titles (60-28-13=19) missing in the book bank?
> >
> Most of them are missing, some of them are dropped because the author is
> not mentioned (explicitly).
>
>
> > Are you
> > trying any more heuristics to reduce the false positives?
> >
>
> The number of false positives is not a really good marker: the associated
> confidence measure of those annotations is even more important. There is no
> real problem with a false positive that has 0.001 confidence. We should
> have displayed that info (next time).
>
> Anyway, there are two things on my agenda:
> 1) restricting by author names. This is a typical false positive from text
> 22: http://openlibrary.org/works/OL15987840W/New_Haven
> It is marked as found (confidence 0.2) because both some parts of the title
> and the author can be found (New Haven Area Heritage Association: New
> Haven). That is a dumb thing to do because: a) the author includes the
> title b) the author and the title occurrence overlap. This can be fixed
> easily.
>
> 2) better understanding of role of order and the token distance between
> author and title. I will probably experiment with different numbers and see
> how the test results change.
>
> These will happen in the next couple of weeks. Will let you know about the
> results.
>
> Cheers
> Mihály
>
> Thanks,
> > Harish
> >
>
> [1]
>
> http://booksearch.blogspot.hu/2010/08/books-of-world-stand-up-and-be-counted.html
>
>
> > On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ
> > <ch...@googlemail.com>wrote:
> >
> > > Hi,
> > >
> > > nice engine ;) Thanks for sharing!
> > >
> > > Best,
> > >  - Fabian
> > >
> > > 2012/9/3 Anuj Kumar <an...@gmail.com>:
> > > > That's great! Thanks for the info.
> > > >
> > > > Regards,
> > > > Anuj
> > > >
> > > > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <he...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi!
> > > >>
> > > >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I
> run
> > > >> the whole stanbol with -Xmx2500M without issues.
> > > >>
> > > >> In earlier iterations I have used ehcache + sophisticated custom hit
> > > >> and miss handlers to save memory, but I had to realize that it
> creates
> > > >> more performance issues than it solves in everyday setups, to I gave
> > > >> up on that.
> > > >>
> > > >> Cheers
> > > >> Mihály
> > > >>
> > > >> On 3 September 2012 15:58, Anuj Kumar <an...@gmail.com> wrote:
> > > >> > Hi Mihály,
> > > >> >
> > > >> > Thanks a lot for sharing this. Looks good.
> > > >> >
> > > >> > I was curious to know the memory requirements to load the
> 5.6million
> > > >> titles
> > > >> > and the whole system to run. If you have any stats, can you please
> > > share
> > > >> > that?
> > > >> >
> > > >> > Regards,
> > > >> > Anuj
> > > >> >
> > > >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <hedermisi@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> >> Hi!
> > > >> >>
> > > >> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
> > > >> >>
> > > >> >>
> > > >>
> > >
> >
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
> > > >> >>
> > > >> >> Bookspotter uses a selection of 5.6M titles from the British
> > National
> > > >> >> Bibliography and the Open Library.
> > > >> >> It scans the incoming text, looking for titles, and in case the
> > > author
> > > >> >> is also mentioned, it produces the corresponding entity
> annotations
> > > >> >> that refer to the proper resource uris of either BNB or OL.
> > > >> >>
> > > >> >> You can check the system out here:
> > > >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
> > > >> >>
> > > >> >> Thanks to the Early Adopter Program, I was able to buy some
> student
> > > >> >> work hours for data cleaning and for some basic testing.
> > > >> >> You might want to read the report on our test set of 25 tests:
> > > >> >>
> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
> > > >> >>
> > > >> >> For details, see the blog post!
> > > >> >>
> > > >> >> Any comments are much appreciated!
> > > >> >> Cheers,
> > > >> >> Mihály
> > > >> >>
> > > >>
> > >
> > >
> > >
> > > --
> > > Fabian
> > > http://twitter.com/fctwitt
> > >
> >
> >
> >
> > --
> > Thanks
> > Harish
> >
>



-- 
Thanks
Harish

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by Mihály Héder <he...@gmail.com>.

Hi!

On 5 September 2012 19:03, harish suvarna <hs...@gmail.com> wrote:

> Hi,
> Nice work and thanks for sharing.
>
> You had quite a good store of book titles of around 5.6million. Why is it
> that the recall is around 50%.?
>

Well this 5.6M is a rather small set. No one knows the total number books
ever written, but google estimates (conservatively) that it is at least 130
million [1].

And as you can imagine there is a long tail effect if we talk about how
well known certain books are. This is why you won't easily cover, say 90%
of the books with even a 50M data set.
The 5.6 million set is the smallest one I experimented with - I like this
size because it is easy to handle. To tell you the truth I was quite happy
with the 50% :)

Anyway, in the long run, it would be much more important to include book
sets for different languages. Of course, both BNB and OL has some foreign
titles but they are mostly for English.

> Are the dropped titles (60-28-13=19) missing in the book bank?
>
Most of them are missing, some of them are dropped because the author is
not mentioned (explicitly).

> Are you
> trying any more heuristics to reduce the false positives?
>

The number of false positives is not a really good marker: the associated
confidence measure of those annotations is even more important. There is no
real problem with a false positive that has 0.001 confidence. We should
have displayed that info (next time).

Anyway, there are two things on my agenda:
1) restricting by author names. This is a typical false positive from text
22: http://openlibrary.org/works/OL15987840W/New_Haven
It is marked as found (confidence 0.2) because both some parts of the title
and the author can be found (New Haven Area Heritage Association: New
Haven). That is a dumb thing to do because: a) the author includes the
title b) the author and the title occurrence overlap. This can be fixed
easily.

2) better understanding of role of order and the token distance between
author and title. I will probably experiment with different numbers and see
how the test results change.

These will happen in the next couple of weeks. Will let you know about the
results.

Cheers
Mihály

Thanks,
> Harish
>

[1]
http://booksearch.blogspot.hu/2010/08/books-of-world-stand-up-and-be-counted.html

> On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ
> <ch...@googlemail.com>wrote:
>
> > Hi,
> >
> > nice engine ;) Thanks for sharing!
> >
> > Best,
> >  - Fabian
> >
> > 2012/9/3 Anuj Kumar <an...@gmail.com>:
> > > That's great! Thanks for the info.
> > >
> > > Regards,
> > > Anuj
> > >
> > > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <he...@gmail.com>
> > wrote:
> > >
> > >> Hi!
> > >>
> > >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run
> > >> the whole stanbol with -Xmx2500M without issues.
> > >>
> > >> In earlier iterations I have used ehcache + sophisticated custom hit
> > >> and miss handlers to save memory, but I had to realize that it creates
> > >> more performance issues than it solves in everyday setups, to I gave
> > >> up on that.
> > >>
> > >> Cheers
> > >> Mihály
> > >>
> > >> On 3 September 2012 15:58, Anuj Kumar <an...@gmail.com> wrote:
> > >> > Hi Mihály,
> > >> >
> > >> > Thanks a lot for sharing this. Looks good.
> > >> >
> > >> > I was curious to know the memory requirements to load the 5.6million
> > >> titles
> > >> > and the whole system to run. If you have any stats, can you please
> > share
> > >> > that?
> > >> >
> > >> > Regards,
> > >> > Anuj
> > >> >
> > >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <he...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Hi!
> > >> >>
> > >> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
> > >> >>
> > >> >>
> > >>
> >
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
> > >> >>
> > >> >> Bookspotter uses a selection of 5.6M titles from the British
> National
> > >> >> Bibliography and the Open Library.
> > >> >> It scans the incoming text, looking for titles, and in case the
> > author
> > >> >> is also mentioned, it produces the corresponding entity annotations
> > >> >> that refer to the proper resource uris of either BNB or OL.
> > >> >>
> > >> >> You can check the system out here:
> > >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
> > >> >>
> > >> >> Thanks to the Early Adopter Program, I was able to buy some student
> > >> >> work hours for data cleaning and for some basic testing.
> > >> >> You might want to read the report on our test set of 25 tests:
> > >> >> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
> > >> >>
> > >> >> For details, see the blog post!
> > >> >>
> > >> >> Any comments are much appreciated!
> > >> >> Cheers,
> > >> >> Mihály
> > >> >>
> > >>
> >
> >
> >
> > --
> > Fabian
> > http://twitter.com/fctwitt
> >
>
>
>
> --
> Thanks
> Harish
>

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by harish suvarna <hs...@gmail.com>.

Hi,
Nice work and thanks for sharing.

You had quite a good store of book titles of around 5.6million. Why is it
that the recall is around 50%.?
Are the dropped titles (60-28-13=19) missing in the book bank? Are you
trying any more heuristics to reduce the false positives?
Thanks,
Harish


On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ
<ch...@googlemail.com>wrote:

> Hi,
>
> nice engine ;) Thanks for sharing!
>
> Best,
>  - Fabian
>
> 2012/9/3 Anuj Kumar <an...@gmail.com>:
> > That's great! Thanks for the info.
> >
> > Regards,
> > Anuj
> >
> > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <he...@gmail.com>
> wrote:
> >
> >> Hi!
> >>
> >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run
> >> the whole stanbol with -Xmx2500M without issues.
> >>
> >> In earlier iterations I have used ehcache + sophisticated custom hit
> >> and miss handlers to save memory, but I had to realize that it creates
> >> more performance issues than it solves in everyday setups, to I gave
> >> up on that.
> >>
> >> Cheers
> >> Mihály
> >>
> >> On 3 September 2012 15:58, Anuj Kumar <an...@gmail.com> wrote:
> >> > Hi Mihály,
> >> >
> >> > Thanks a lot for sharing this. Looks good.
> >> >
> >> > I was curious to know the memory requirements to load the 5.6million
> >> titles
> >> > and the whole system to run. If you have any stats, can you please
> share
> >> > that?
> >> >
> >> > Regards,
> >> > Anuj
> >> >
> >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <he...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi!
> >> >>
> >> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
> >> >>
> >> >>
> >>
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
> >> >>
> >> >> Bookspotter uses a selection of 5.6M titles from the British National
> >> >> Bibliography and the Open Library.
> >> >> It scans the incoming text, looking for titles, and in case the
> author
> >> >> is also mentioned, it produces the corresponding entity annotations
> >> >> that refer to the proper resource uris of either BNB or OL.
> >> >>
> >> >> You can check the system out here:
> >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
> >> >>
> >> >> Thanks to the Early Adopter Program, I was able to buy some student
> >> >> work hours for data cleaning and for some basic testing.
> >> >> You might want to read the report on our test set of 25 tests:
> >> >> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
> >> >>
> >> >> For details, see the blog post!
> >> >>
> >> >> Any comments are much appreciated!
> >> >> Cheers,
> >> >> Mihály
> >> >>
> >>
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt
>



-- 
Thanks
Harish

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by Fabian Christ <ch...@googlemail.com>.

Hi,

nice engine ;) Thanks for sharing!

Best,
 - Fabian

2012/9/3 Anuj Kumar <an...@gmail.com>:
> That's great! Thanks for the info.
>
> Regards,
> Anuj
>
> On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <he...@gmail.com> wrote:
>
>> Hi!
>>
>> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run
>> the whole stanbol with -Xmx2500M without issues.
>>
>> In earlier iterations I have used ehcache + sophisticated custom hit
>> and miss handlers to save memory, but I had to realize that it creates
>> more performance issues than it solves in everyday setups, to I gave
>> up on that.
>>
>> Cheers
>> Mihály
>>
>> On 3 September 2012 15:58, Anuj Kumar <an...@gmail.com> wrote:
>> > Hi Mihály,
>> >
>> > Thanks a lot for sharing this. Looks good.
>> >
>> > I was curious to know the memory requirements to load the 5.6million
>> titles
>> > and the whole system to run. If you have any stats, can you please share
>> > that?
>> >
>> > Regards,
>> > Anuj
>> >
>> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <he...@gmail.com>
>> wrote:
>> >
>> >> Hi!
>> >>
>> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
>> >>
>> >>
>> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
>> >>
>> >> Bookspotter uses a selection of 5.6M titles from the British National
>> >> Bibliography and the Open Library.
>> >> It scans the incoming text, looking for titles, and in case the author
>> >> is also mentioned, it produces the corresponding entity annotations
>> >> that refer to the proper resource uris of either BNB or OL.
>> >>
>> >> You can check the system out here:
>> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
>> >>
>> >> Thanks to the Early Adopter Program, I was able to buy some student
>> >> work hours for data cleaning and for some basic testing.
>> >> You might want to read the report on our test set of 25 tests:
>> >> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
>> >>
>> >> For details, see the blog post!
>> >>
>> >> Any comments are much appreciated!
>> >> Cheers,
>> >> Mihály
>> >>
>>



-- 
Fabian
http://twitter.com/fctwitt

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by Anuj Kumar <an...@gmail.com>.

That's great! Thanks for the info.

Regards,
Anuj

On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <he...@gmail.com> wrote:

> Hi!
>
> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run
> the whole stanbol with -Xmx2500M without issues.
>
> In earlier iterations I have used ehcache + sophisticated custom hit
> and miss handlers to save memory, but I had to realize that it creates
> more performance issues than it solves in everyday setups, to I gave
> up on that.
>
> Cheers
> Mihály
>
> On 3 September 2012 15:58, Anuj Kumar <an...@gmail.com> wrote:
> > Hi Mihály,
> >
> > Thanks a lot for sharing this. Looks good.
> >
> > I was curious to know the memory requirements to load the 5.6million
> titles
> > and the whole system to run. If you have any stats, can you please share
> > that?
> >
> > Regards,
> > Anuj
> >
> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <he...@gmail.com>
> wrote:
> >
> >> Hi!
> >>
> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
> >>
> >>
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
> >>
> >> Bookspotter uses a selection of 5.6M titles from the British National
> >> Bibliography and the Open Library.
> >> It scans the incoming text, looking for titles, and in case the author
> >> is also mentioned, it produces the corresponding entity annotations
> >> that refer to the proper resource uris of either BNB or OL.
> >>
> >> You can check the system out here:
> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
> >>
> >> Thanks to the Early Adopter Program, I was able to buy some student
> >> work hours for data cleaning and for some basic testing.
> >> You might want to read the report on our test set of 25 tests:
> >> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
> >>
> >> For details, see the blog post!
> >>
> >> Any comments are much appreciated!
> >> Cheers,
> >> Mihály
> >>
>

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by Mihály Héder <he...@gmail.com>.

Hi!

Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run
the whole stanbol with -Xmx2500M without issues.

In earlier iterations I have used ehcache + sophisticated custom hit
and miss handlers to save memory, but I had to realize that it creates
more performance issues than it solves in everyday setups, to I gave
up on that.

Cheers
Mihály

On 3 September 2012 15:58, Anuj Kumar <an...@gmail.com> wrote:
> Hi Mihály,
>
> Thanks a lot for sharing this. Looks good.
>
> I was curious to know the memory requirements to load the 5.6million titles
> and the whole system to run. If you have any stats, can you please share
> that?
>
> Regards,
> Anuj
>
> On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <he...@gmail.com> wrote:
>
>> Hi!
>>
>> let me introduce BookSpotter Enhancement Engige by Sztaki:
>>
>> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
>>
>> Bookspotter uses a selection of 5.6M titles from the British National
>> Bibliography and the Open Library.
>> It scans the incoming text, looking for titles, and in case the author
>> is also mentioned, it produces the corresponding entity annotations
>> that refer to the proper resource uris of either BNB or OL.
>>
>> You can check the system out here:
>> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
>>
>> Thanks to the Early Adopter Program, I was able to buy some student
>> work hours for data cleaning and for some basic testing.
>> You might want to read the report on our test set of 25 tests:
>> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
>>
>> For details, see the blog post!
>>
>> Any comments are much appreciated!
>> Cheers,
>> Mihály
>>

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Posted by Anuj Kumar <an...@gmail.com>.

Hi Mihály,

Thanks a lot for sharing this. Looks good.

I was curious to know the memory requirements to load the 5.6million titles
and the whole system to run. If you have any stats, can you please share
that?

Regards,
Anuj

On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <he...@gmail.com> wrote:

> Hi!
>
> let me introduce BookSpotter Enhancement Engige by Sztaki:
>
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
>
> Bookspotter uses a selection of 5.6M titles from the British National
> Bibliography and the Open Library.
> It scans the incoming text, looking for titles, and in case the author
> is also mentioned, it produces the corresponding entity annotations
> that refer to the proper resource uris of either BNB or OL.
>
> You can check the system out here:
> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
>
> Thanks to the Early Adopter Program, I was able to buy some student
> work hours for data cleaning and for some basic testing.
> You might want to read the report on our test set of 25 tests:
> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
>
> For details, see the blog post!
>
> Any comments are much appreciated!
> Cheers,
> Mihály
>