You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Srecko Joksimovic <sr...@gmail.com> on 2012/03/12 18:03:11 UTC

Indexing and searching using Apache Stanbol

Hi,

 

Until now I have developed few applications for annotating documents using
Apache Stanbol. Now I need to add indexing and search capabilities.

I tried ContentHub
(http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
in the way that I started full launcher and access web interface. There are
few possibilities: to provide text, to upload document, to provide an URI. I
tried to upload a few txt documents. I didn't get any extracted entities,
but search (using Web View) worked fine. Another step was to upload pdf
documents and I got extracted entities grouped by People, Places Concepts
categories. It was also in the list of recently uploaded documents, but I
couldn't find any term from that document.

 

I suppose that I will have to provide a stream from pdf (or any other kind)
documents and to index it like text? I need all mentioned functionalities
(index text, docs, URIs.) using Java application and I would appreciate a
code example, if it is available, please.

 

Thank you!

 

Srecko Joksimovic


Re: Indexing and searching using Apache Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.
Let me just add one additional bit of information to that

If you change the alias for the "Apache Stanbol Web Application" this will NOT affect the path for the published Solr Servers ("{host}/solr" by default.

To change this you will need to also change the configuration of the "SolrServerPublishingComponent" to "/{alias}/solr/" (property: org.apache.stanbol.commons.solr.web.dispatchfilter.prefix).

Note that older Stanbol versions also included a configuration for the "SolrDispatchFilterComponent" (search for "Dispatch Filter Configuration" in the configuration tab). If you find such a configuration you can safely remove it as it just duplicates the functionality provided by the above. If you not remove this configuration the Solr indexes might be available with and without {alias}.
(Stanbol versions based on a revision < 1299616 might be affected by that).

best
Rupert

On 13.03.2012, at 13:58, ajs6f@virginia.edu wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Yes, you did: {grin}
> 
> http://markmail.org/message/xdrcxwkuwgo3u65d
> 
> - ---
> A. Soroka
> Software & Systems Engineering :: Online Library Environment
> the University of Virginia Library
> 
> On Mar 13, 2012, at 8:43 AM, srecko joksimovic wrote:
> 
>> Hi,
>> 
>> I forgot to mentioned, but I think that I have posted this question before.
>> Anyway, is it possible to configure Stanbol to run at
>> http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?
>> 
>> Because of the company policy I need to define application URL, and it must
>> have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means
>> (for example) that I need to have:
>> 
>> http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of
>> http://localhost:9999/enhancer/engine.
>> 
>> Best,
>> Srecko
>> 
>> On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic <
>> sreckojoksimovic@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Ok, looks like I didn't understand that. It's clear now.
>>> 
>>> Thank you.
>>> 
>>> Srecko
>>> 
>>> 
>>> On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
>>> rupert.westenthaler@gmail.com> wrote:
>>> 
>>>> Hi
>>>> 
>>>> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
>>>> <sr...@gmail.com> wrote:
>>>>> Hi Rupert,
>>>>> 
>>>>> and thank you for the answer. I need to read few more things, but the
>>>> answer
>>>>> helped me a lot.
>>>> 
>>>> great!
>>>> 
>>>>> If I understood well, the search is case sensitive, and if I need case
>>>>> insensitive search, I will have to implement application specific logic?
>>>>> 
>>>> 
>>>> Keyword searches via the content hub and Solr query for the field
>>>> "text_all" are case insensitive!
>>>> 
>>>> Only searches for the fields "organizations_t", "people_t" and
>>>> "places_t" are case sensitive. However I would consider this as a bug
>>>> and the comment (**) in my previous mail suggests to correct that.
>>>> 
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>>> Best,
>>>>> Srecko
>>>>> 
>>>>> 
>>>>> On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
>>>>> <ru...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Srecko, all
>>>>>> 
>>>>>> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>>>>>> 
>>>>>> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>>>>>> <sr...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Until now I have developed few applications for annotating documents
>>>>>>> using
>>>>>>> Apache Stanbol. Now I need to add indexing and search capabilities.
>>>>>>> 
>>>>>>> I tried ContentHub
>>>>>>> 
>>>>>>> (
>>>> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>>>>>>> in the way that I started full launcher and access web interface.
>>>> There
>>>>>>> are
>>>>>>> few possibilities: to provide text, to upload document, to provide an
>>>>>>> URI… I
>>>>>>> tried to upload a few txt documents. I didn’t get any extracted
>>>>>>> entities,
>>>>>> 
>>>>>> The content hub shows the number of extracted enhancements. This can
>>>>>> easily be used as indicator if the Stanbol Enhancer was able to
>>>>>> extract knowledge form the parsed content.
>>>>>> 
>>>>>> Typical reasons for not getting expected enhancement results are:
>>>>>> 
>>>>>> 1. unsupported content type: The current version of Apache Stanbol
>>>>>> uses the
>>>>>> [TikaEngine](
>>>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
>>>> )
>>>>>> to process non-plain-text content parsed to the Stanbol
>>>>>> Enhancer/Contenthub. So everything that is covered by Apache Tika
>>>>>> should also work just fine with Apache Stanbol.
>>>>>> 
>>>>>> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>>>>>> Entity Recognition) do only support some languages. If the parsed
>>>>>> content is in an other language the will not be able to process the
>>>>>> parsed content. With the default configuration of Stanbol only english
>>>>>> (and in the newest version spanish and dutch) documents will work.
>>>>>> Users with custom configurations will also be able to process
>>>>>> documents with other languages)
>>>>>> 
>>>>>>> but search (using Web View) worked fine.
>>>>>> 
>>>>>> This is because the Conenthub also supports full text search over the
>>>>>> parsed content. (*)
>>>>>> 
>>>>>>> Another step was to upload pdf
>>>>>>> documents and I got extracted entities grouped by People, Places
>>>>>>> Concepts
>>>>>>> categories. It was also in the list of recently uploaded documents,
>>>> but
>>>>>>> I
>>>>>>> couldn’t find any term from that document.
>>>>>>> 
>>>>>> 
>>>>>> Based on your request I tried the following (with the default
>>>>>> configuration of the Full launcher)
>>>>>> NOTE: this excludes the possibility to create your own search index by
>>>>>> using LDPath.
>>>>>> 
>>>>>> 1) upload some files to the content hub
>>>>>> 
>>>>>>  * file upload worked (some scientific papers from the local HD
>>>>>>  * URL upload worked (some technical blogs + comments)
>>>>>>  * pasting text worked (some of the examples included for the
>>>> enhancer)
>>>>>>  * based on the UI I got > 100 enhancements for all tested PDFs
>>>>>> 
>>>>>> 2) test of the contenthub search
>>>>>> 
>>>>>>  * keyword search worked also for me
>>>>>> 
>>>>>> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>>>>>> 
>>>>>>  * searches like
>>>>>> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>>>>>> worked fine. Note that searches are case sensitive (**)
>>>>>>  * I think the keyword search uses the "text_all" field. So queries
>>>>>> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>>>>>> return the same values as the UI of the content hub. This fields
>>>>>> basically supports full text search.
>>>>>>  * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>>>>>> *_workinstitutions ...) where missing. I think this is expected,
>>>>>> because such fields do require a dbpedia index with the required
>>>>>> fields.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> I suppose that I will have to provide a stream from pdf (or any other
>>>>>>> kind)
>>>>>>> documents and to index it like text? I need all mentioned
>>>>>>> functionalities
>>>>>>> (index text, docs, URIs…) using Java application and I would
>>>> appreciate
>>>>>>> a
>>>>>>> code example, if it is available, please.
>>>>>>> 
>>>>>> 
>>>>>> I think parsing of URIs is currently not possible by using the RESTful
>>>>>> API. For using the RESTful services I would recommend you the use of
>>>>>> the Apache Http commons client. Code examples on how to build requests
>>>>>> can be found at
>>>>>> 
>>>>>> 
>>>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>>>>>> 
>>>>>> 
>>>>>> best
>>>>>> Rupert
>>>>>> 
>>>>>> Comments intended for Stanbol Developers:
>>>>>> -----
>>>>>> 
>>>>>> (*) Normally I would expect the SolrIndex to only include the plain
>>>>>> text version of the parsed content within a field with stored=false.
>>>>>> However I assume that currently the index needs to store the actual
>>>>>> content, because is is also used to store the data. Is this correct?
>>>>>> If this is the case than it will get fixed with STANBOL-471 in any
>>>> case.
>>>>>> 
>>>>>> I also noted that "stanbolreserved_content" currently stores the
>>>>>> content as parsed to the content hub but is configured as
>>>>>> indexed="true" and type="text_general". So in case of an PDF file the
>>>>>> binary content is processed as natural language text AND is also
>>>>>> indexed!
>>>>>> So if this field is used for full text indexing (what I think is not
>>>>>> the case, because I think the "text_all" field is used for that) than
>>>>>> you need to ensure that the plain text version is used for full text
>>>>>> indexing. The plain text contents are available from enhanced
>>>>>> ContentItems by using
>>>>>> 
>>>>>> 
>>>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>>>>>> As an alternative one could also use the features introduced by
>>>>>> STANBOL-500 for this.
>>>>>> If this field is used to store the actual content, than you should use
>>>>>> an binary field type and deactivate indexing for this field.
>>>>>> 
>>>>>> (**) All *_t fields use string as field type. This means that no
>>>>>> tokenizer is used AND queries are case sensitive. I do not think this
>>>>>> is a good decision and would rather us the already defined "text_ws"
>>>>>> type (white space tokenizer, word delimiter and lower case)
>>>>>> 
>>>>>> 
>>>>>> best
>>>>>> Rupert
>>>>>> 
>>>>>> --
>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>> 
>>> 
>>> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJPX0RpAAoJEATpPYSyaoIkDakH/AwYwQAr7tSGOo8k1RPRSFN2
> rEw08y14v0Pun9I83s0o83vTkENS+QOVkxmnxHJssRuFIe8OiUypAA29ZuiQ6DQk
> qcZ81AHik4Nx7gWamxVt+1LcobZ8P7/2iYkDfAoGdarU4cRhfAfRgUOb8Rha/bs3
> 0ApbZB/7gxk8YSj1OhY+xo78l4uDOHA94STYch6u/iQnhHXGDU8yQ4rxyX/EW7He
> Q3I7YVQXisxaNAgkQ/Vdgraw3ujJv45Wrv0wGCA0BWEJZRjlK4uil5/9oMogFdZY
> OoLQ3FkQPeRJdJkwStW1HscT6dv+sjZNOmkmCCFL8OC5dqXSuC8S/nDu+jLHzvA=
> =PgTe
> -----END PGP SIGNATURE-----


Re: Indexing and searching using Apache Stanbol

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yes, you did: {grin}

http://markmail.org/message/xdrcxwkuwgo3u65d

- ---
A. Soroka
Software & Systems Engineering :: Online Library Environment
the University of Virginia Library

On Mar 13, 2012, at 8:43 AM, srecko joksimovic wrote:

> Hi,
> 
> I forgot to mentioned, but I think that I have posted this question before.
> Anyway, is it possible to configure Stanbol to run at
> http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?
> 
> Because of the company policy I need to define application URL, and it must
> have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means
> (for example) that I need to have:
> 
> http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of
> http://localhost:9999/enhancer/engine.
> 
> Best,
> Srecko
> 
> On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic <
> sreckojoksimovic@gmail.com> wrote:
> 
>> Hi,
>> 
>> Ok, looks like I didn't understand that. It's clear now.
>> 
>> Thank you.
>> 
>> Srecko
>> 
>> 
>> On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>> 
>>> Hi
>>> 
>>> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
>>> <sr...@gmail.com> wrote:
>>>> Hi Rupert,
>>>> 
>>>> and thank you for the answer. I need to read few more things, but the
>>> answer
>>>> helped me a lot.
>>> 
>>> great!
>>> 
>>>> If I understood well, the search is case sensitive, and if I need case
>>>> insensitive search, I will have to implement application specific logic?
>>>> 
>>> 
>>> Keyword searches via the content hub and Solr query for the field
>>> "text_all" are case insensitive!
>>> 
>>> Only searches for the fields "organizations_t", "people_t" and
>>> "places_t" are case sensitive. However I would consider this as a bug
>>> and the comment (**) in my previous mail suggests to correct that.
>>> 
>>> 
>>> best
>>> Rupert
>>> 
>>>> Best,
>>>> Srecko
>>>> 
>>>> 
>>>> On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
>>>> <ru...@gmail.com> wrote:
>>>>> 
>>>>> Hi Srecko, all
>>>>> 
>>>>> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>>>>> 
>>>>> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>>>>> <sr...@gmail.com> wrote:
>>>>>> 
>>>>>> Until now I have developed few applications for annotating documents
>>>>>> using
>>>>>> Apache Stanbol. Now I need to add indexing and search capabilities.
>>>>>> 
>>>>>> I tried ContentHub
>>>>>> 
>>>>>> (
>>> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>>>>>> in the way that I started full launcher and access web interface.
>>> There
>>>>>> are
>>>>>> few possibilities: to provide text, to upload document, to provide an
>>>>>> URI… I
>>>>>> tried to upload a few txt documents. I didn’t get any extracted
>>>>>> entities,
>>>>> 
>>>>> The content hub shows the number of extracted enhancements. This can
>>>>> easily be used as indicator if the Stanbol Enhancer was able to
>>>>> extract knowledge form the parsed content.
>>>>> 
>>>>> Typical reasons for not getting expected enhancement results are:
>>>>> 
>>>>> 1. unsupported content type: The current version of Apache Stanbol
>>>>> uses the
>>>>> [TikaEngine](
>>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
>>> )
>>>>> to process non-plain-text content parsed to the Stanbol
>>>>> Enhancer/Contenthub. So everything that is covered by Apache Tika
>>>>> should also work just fine with Apache Stanbol.
>>>>> 
>>>>> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>>>>> Entity Recognition) do only support some languages. If the parsed
>>>>> content is in an other language the will not be able to process the
>>>>> parsed content. With the default configuration of Stanbol only english
>>>>> (and in the newest version spanish and dutch) documents will work.
>>>>> Users with custom configurations will also be able to process
>>>>> documents with other languages)
>>>>> 
>>>>>> but search (using Web View) worked fine.
>>>>> 
>>>>> This is because the Conenthub also supports full text search over the
>>>>> parsed content. (*)
>>>>> 
>>>>>> Another step was to upload pdf
>>>>>> documents and I got extracted entities grouped by People, Places
>>>>>> Concepts
>>>>>> categories. It was also in the list of recently uploaded documents,
>>> but
>>>>>> I
>>>>>> couldn’t find any term from that document.
>>>>>> 
>>>>> 
>>>>> Based on your request I tried the following (with the default
>>>>> configuration of the Full launcher)
>>>>> NOTE: this excludes the possibility to create your own search index by
>>>>> using LDPath.
>>>>> 
>>>>> 1) upload some files to the content hub
>>>>> 
>>>>>   * file upload worked (some scientific papers from the local HD
>>>>>   * URL upload worked (some technical blogs + comments)
>>>>>   * pasting text worked (some of the examples included for the
>>> enhancer)
>>>>>   * based on the UI I got > 100 enhancements for all tested PDFs
>>>>> 
>>>>> 2) test of the contenthub search
>>>>> 
>>>>>   * keyword search worked also for me
>>>>> 
>>>>> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>>>>> 
>>>>>   * searches like
>>>>> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>>>>> worked fine. Note that searches are case sensitive (**)
>>>>>   * I think the keyword search uses the "text_all" field. So queries
>>>>> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>>>>> return the same values as the UI of the content hub. This fields
>>>>> basically supports full text search.
>>>>>   * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>>>>> *_workinstitutions ...) where missing. I think this is expected,
>>>>> because such fields do require a dbpedia index with the required
>>>>> fields.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> I suppose that I will have to provide a stream from pdf (or any other
>>>>>> kind)
>>>>>> documents and to index it like text? I need all mentioned
>>>>>> functionalities
>>>>>> (index text, docs, URIs…) using Java application and I would
>>> appreciate
>>>>>> a
>>>>>> code example, if it is available, please.
>>>>>> 
>>>>> 
>>>>> I think parsing of URIs is currently not possible by using the RESTful
>>>>> API. For using the RESTful services I would recommend you the use of
>>>>> the Apache Http commons client. Code examples on how to build requests
>>>>> can be found at
>>>>> 
>>>>> 
>>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>>>>> 
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>> Comments intended for Stanbol Developers:
>>>>> -----
>>>>> 
>>>>> (*) Normally I would expect the SolrIndex to only include the plain
>>>>> text version of the parsed content within a field with stored=false.
>>>>> However I assume that currently the index needs to store the actual
>>>>> content, because is is also used to store the data. Is this correct?
>>>>> If this is the case than it will get fixed with STANBOL-471 in any
>>> case.
>>>>> 
>>>>> I also noted that "stanbolreserved_content" currently stores the
>>>>> content as parsed to the content hub but is configured as
>>>>> indexed="true" and type="text_general". So in case of an PDF file the
>>>>> binary content is processed as natural language text AND is also
>>>>> indexed!
>>>>> So if this field is used for full text indexing (what I think is not
>>>>> the case, because I think the "text_all" field is used for that) than
>>>>> you need to ensure that the plain text version is used for full text
>>>>> indexing. The plain text contents are available from enhanced
>>>>> ContentItems by using
>>>>> 
>>>>> 
>>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>>>>> As an alternative one could also use the features introduced by
>>>>> STANBOL-500 for this.
>>>>> If this field is used to store the actual content, than you should use
>>>>> an binary field type and deactivate indexing for this field.
>>>>> 
>>>>> (**) All *_t fields use string as field type. This means that no
>>>>> tokenizer is used AND queries are case sensitive. I do not think this
>>>>> is a good decision and would rather us the already defined "text_ws"
>>>>> type (white space tokenizer, word delimiter and lower case)
>>>>> 
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>> --
>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> 
>> 
>> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJPX0RpAAoJEATpPYSyaoIkDakH/AwYwQAr7tSGOo8k1RPRSFN2
rEw08y14v0Pun9I83s0o83vTkENS+QOVkxmnxHJssRuFIe8OiUypAA29ZuiQ6DQk
qcZ81AHik4Nx7gWamxVt+1LcobZ8P7/2iYkDfAoGdarU4cRhfAfRgUOb8Rha/bs3
0ApbZB/7gxk8YSj1OhY+xo78l4uDOHA94STYch6u/iQnhHXGDU8yQ4rxyX/EW7He
Q3I7YVQXisxaNAgkQ/Vdgraw3ujJv45Wrv0wGCA0BWEJZRjlK4uil5/9oMogFdZY
OoLQ3FkQPeRJdJkwStW1HscT6dv+sjZNOmkmCCFL8OC5dqXSuC8S/nDu+jLHzvA=
=PgTe
-----END PGP SIGNATURE-----

Re: Indexing and searching using Apache Stanbol

Posted by srecko joksimovic <sr...@gmail.com>.
Hi,

I forgot to mentioned, but I think that I have posted this question before.
Anyway, is it possible to configure Stanbol to run at
http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?

Because of the company policy I need to define application URL, and it must
have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means
(for example) that I need to have:

http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of
http://localhost:9999/enhancer/engine.

Best,
Srecko

On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic <
sreckojoksimovic@gmail.com> wrote:

> Hi,
>
> Ok, looks like I didn't understand that. It's clear now.
>
> Thank you.
>
> Srecko
>
>
> On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
>
>> Hi
>>
>> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
>> <sr...@gmail.com> wrote:
>> > Hi Rupert,
>> >
>> > and thank you for the answer. I need to read few more things, but the
>> answer
>> > helped me a lot.
>>
>> great!
>>
>> > If I understood well, the search is case sensitive, and if I need case
>> > insensitive search, I will have to implement application specific logic?
>> >
>>
>> Keyword searches via the content hub and Solr query for the field
>> "text_all" are case insensitive!
>>
>> Only searches for the fields "organizations_t", "people_t" and
>> "places_t" are case sensitive. However I would consider this as a bug
>> and the comment (**) in my previous mail suggests to correct that.
>>
>>
>> best
>> Rupert
>>
>> > Best,
>> > Srecko
>> >
>> >
>> > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
>> > <ru...@gmail.com> wrote:
>> >>
>> >> Hi Srecko, all
>> >>
>> >> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>> >>
>> >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>> >> <sr...@gmail.com> wrote:
>> >> >
>> >> > Until now I have developed few applications for annotating documents
>> >> > using
>> >> > Apache Stanbol. Now I need to add indexing and search capabilities.
>> >> >
>> >> > I tried ContentHub
>> >> >
>> >> > (
>> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>> >> > in the way that I started full launcher and access web interface.
>> There
>> >> > are
>> >> > few possibilities: to provide text, to upload document, to provide an
>> >> > URI… I
>> >> > tried to upload a few txt documents. I didn’t get any extracted
>> >> > entities,
>> >>
>> >> The content hub shows the number of extracted enhancements. This can
>> >> easily be used as indicator if the Stanbol Enhancer was able to
>> >> extract knowledge form the parsed content.
>> >>
>> >> Typical reasons for not getting expected enhancement results are:
>> >>
>> >> 1. unsupported content type: The current version of Apache Stanbol
>> >> uses the
>> >> [TikaEngine](
>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
>> )
>> >> to process non-plain-text content parsed to the Stanbol
>> >> Enhancer/Contenthub. So everything that is covered by Apache Tika
>> >> should also work just fine with Apache Stanbol.
>> >>
>> >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>> >> Entity Recognition) do only support some languages. If the parsed
>> >> content is in an other language the will not be able to process the
>> >> parsed content. With the default configuration of Stanbol only english
>> >> (and in the newest version spanish and dutch) documents will work.
>> >> Users with custom configurations will also be able to process
>> >> documents with other languages)
>> >>
>> >> > but search (using Web View) worked fine.
>> >>
>> >> This is because the Conenthub also supports full text search over the
>> >> parsed content. (*)
>> >>
>> >> >Another step was to upload pdf
>> >> > documents and I got extracted entities grouped by People, Places
>> >> > Concepts
>> >> > categories. It was also in the list of recently uploaded documents,
>> but
>> >> > I
>> >> > couldn’t find any term from that document.
>> >> >
>> >>
>> >> Based on your request I tried the following (with the default
>> >> configuration of the Full launcher)
>> >> NOTE: this excludes the possibility to create your own search index by
>> >> using LDPath.
>> >>
>> >> 1) upload some files to the content hub
>> >>
>> >>    * file upload worked (some scientific papers from the local HD
>> >>    * URL upload worked (some technical blogs + comments)
>> >>    * pasting text worked (some of the examples included for the
>> enhancer)
>> >>    * based on the UI I got > 100 enhancements for all tested PDFs
>> >>
>> >> 2) test of the contenthub search
>> >>
>> >>    * keyword search worked also for me
>> >>
>> >> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>> >>
>> >>    * searches like
>> >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>> >> worked fine. Note that searches are case sensitive (**)
>> >>    * I think the keyword search uses the "text_all" field. So queries
>> >> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>> >> return the same values as the UI of the content hub. This fields
>> >> basically supports full text search.
>> >>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>> >> *_workinstitutions ...) where missing. I think this is expected,
>> >> because such fields do require a dbpedia index with the required
>> >> fields.
>> >>
>> >>
>> >> >
>> >> > I suppose that I will have to provide a stream from pdf (or any other
>> >> > kind)
>> >> > documents and to index it like text? I need all mentioned
>> >> > functionalities
>> >> > (index text, docs, URIs…) using Java application and I would
>> appreciate
>> >> > a
>> >> > code example, if it is available, please.
>> >> >
>> >>
>> >> I think parsing of URIs is currently not possible by using the RESTful
>> >> API. For using the RESTful services I would recommend you the use of
>> >> the Apache Http commons client. Code examples on how to build requests
>> >> can be found at
>> >>
>> >>
>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>> >>
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> Comments intended for Stanbol Developers:
>> >> -----
>> >>
>> >> (*) Normally I would expect the SolrIndex to only include the plain
>> >> text version of the parsed content within a field with stored=false.
>> >> However I assume that currently the index needs to store the actual
>> >> content, because is is also used to store the data. Is this correct?
>> >> If this is the case than it will get fixed with STANBOL-471 in any
>> case.
>> >>
>> >> I also noted that "stanbolreserved_content" currently stores the
>> >> content as parsed to the content hub but is configured as
>> >> indexed="true" and type="text_general". So in case of an PDF file the
>> >> binary content is processed as natural language text AND is also
>> >> indexed!
>> >> So if this field is used for full text indexing (what I think is not
>> >> the case, because I think the "text_all" field is used for that) than
>> >> you need to ensure that the plain text version is used for full text
>> >> indexing. The plain text contents are available from enhanced
>> >> ContentItems by using
>> >>
>> >>
>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>> >> As an alternative one could also use the features introduced by
>> >> STANBOL-500 for this.
>> >> If this field is used to store the actual content, than you should use
>> >> an binary field type and deactivate indexing for this field.
>> >>
>> >> (**) All *_t fields use string as field type. This means that no
>> >> tokenizer is used AND queries are case sensitive. I do not think this
>> >> is a good decision and would rather us the already defined "text_ws"
>> >> type (white space tokenizer, word delimiter and lower case)
>> >>
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> --
>> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> | A-5500 Bischofshofen
>> >
>> >
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>

Re: Indexing and searching using Apache Stanbol

Posted by srecko joksimovic <sr...@gmail.com>.
Hi,

Ok, looks like I didn't understand that. It's clear now.

Thank you.

Srecko

On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi
>
> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
> <sr...@gmail.com> wrote:
> > Hi Rupert,
> >
> > and thank you for the answer. I need to read few more things, but the
> answer
> > helped me a lot.
>
> great!
>
> > If I understood well, the search is case sensitive, and if I need case
> > insensitive search, I will have to implement application specific logic?
> >
>
> Keyword searches via the content hub and Solr query for the field
> "text_all" are case insensitive!
>
> Only searches for the fields "organizations_t", "people_t" and
> "places_t" are case sensitive. However I would consider this as a bug
> and the comment (**) in my previous mail suggests to correct that.
>
>
> best
> Rupert
>
> > Best,
> > Srecko
> >
> >
> > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
> > <ru...@gmail.com> wrote:
> >>
> >> Hi Srecko, all
> >>
> >> @Stanbol developers: Note (*) and (**) comments at the end of this mail
> >>
> >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
> >> <sr...@gmail.com> wrote:
> >> >
> >> > Until now I have developed few applications for annotating documents
> >> > using
> >> > Apache Stanbol. Now I need to add indexing and search capabilities.
> >> >
> >> > I tried ContentHub
> >> >
> >> > (
> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
> >> > in the way that I started full launcher and access web interface.
> There
> >> > are
> >> > few possibilities: to provide text, to upload document, to provide an
> >> > URI… I
> >> > tried to upload a few txt documents. I didn’t get any extracted
> >> > entities,
> >>
> >> The content hub shows the number of extracted enhancements. This can
> >> easily be used as indicator if the Stanbol Enhancer was able to
> >> extract knowledge form the parsed content.
> >>
> >> Typical reasons for not getting expected enhancement results are:
> >>
> >> 1. unsupported content type: The current version of Apache Stanbol
> >> uses the
> >> [TikaEngine](
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
> )
> >> to process non-plain-text content parsed to the Stanbol
> >> Enhancer/Contenthub. So everything that is covered by Apache Tika
> >> should also work just fine with Apache Stanbol.
> >>
> >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
> >> Entity Recognition) do only support some languages. If the parsed
> >> content is in an other language the will not be able to process the
> >> parsed content. With the default configuration of Stanbol only english
> >> (and in the newest version spanish and dutch) documents will work.
> >> Users with custom configurations will also be able to process
> >> documents with other languages)
> >>
> >> > but search (using Web View) worked fine.
> >>
> >> This is because the Conenthub also supports full text search over the
> >> parsed content. (*)
> >>
> >> >Another step was to upload pdf
> >> > documents and I got extracted entities grouped by People, Places
> >> > Concepts
> >> > categories. It was also in the list of recently uploaded documents,
> but
> >> > I
> >> > couldn’t find any term from that document.
> >> >
> >>
> >> Based on your request I tried the following (with the default
> >> configuration of the Full launcher)
> >> NOTE: this excludes the possibility to create your own search index by
> >> using LDPath.
> >>
> >> 1) upload some files to the content hub
> >>
> >>    * file upload worked (some scientific papers from the local HD
> >>    * URL upload worked (some technical blogs + comments)
> >>    * pasting text worked (some of the examples included for the
> enhancer)
> >>    * based on the UI I got > 100 enhancements for all tested PDFs
> >>
> >> 2) test of the contenthub search
> >>
> >>    * keyword search worked also for me
> >>
> >> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
> >>
> >>    * searches like
> >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
> >> worked fine. Note that searches are case sensitive (**)
> >>    * I think the keyword search uses the "text_all" field. So queries
> >> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
> >> return the same values as the UI of the content hub. This fields
> >> basically supports full text search.
> >>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
> >> *_workinstitutions ...) where missing. I think this is expected,
> >> because such fields do require a dbpedia index with the required
> >> fields.
> >>
> >>
> >> >
> >> > I suppose that I will have to provide a stream from pdf (or any other
> >> > kind)
> >> > documents and to index it like text? I need all mentioned
> >> > functionalities
> >> > (index text, docs, URIs…) using Java application and I would
> appreciate
> >> > a
> >> > code example, if it is available, please.
> >> >
> >>
> >> I think parsing of URIs is currently not possible by using the RESTful
> >> API. For using the RESTful services I would recommend you the use of
> >> the Apache Http commons client. Code examples on how to build requests
> >> can be found at
> >>
> >>
> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
> >>
> >>
> >> best
> >> Rupert
> >>
> >> Comments intended for Stanbol Developers:
> >> -----
> >>
> >> (*) Normally I would expect the SolrIndex to only include the plain
> >> text version of the parsed content within a field with stored=false.
> >> However I assume that currently the index needs to store the actual
> >> content, because is is also used to store the data. Is this correct?
> >> If this is the case than it will get fixed with STANBOL-471 in any case.
> >>
> >> I also noted that "stanbolreserved_content" currently stores the
> >> content as parsed to the content hub but is configured as
> >> indexed="true" and type="text_general". So in case of an PDF file the
> >> binary content is processed as natural language text AND is also
> >> indexed!
> >> So if this field is used for full text indexing (what I think is not
> >> the case, because I think the "text_all" field is used for that) than
> >> you need to ensure that the plain text version is used for full text
> >> indexing. The plain text contents are available from enhanced
> >> ContentItems by using
> >>
> >>
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> >> As an alternative one could also use the features introduced by
> >> STANBOL-500 for this.
> >> If this field is used to store the actual content, than you should use
> >> an binary field type and deactivate indexing for this field.
> >>
> >> (**) All *_t fields use string as field type. This means that no
> >> tokenizer is used AND queries are case sensitive. I do not think this
> >> is a good decision and would rather us the already defined "text_ws"
> >> type (white space tokenizer, word delimiter and lower case)
> >>
> >>
> >> best
> >> Rupert
> >>
> >> --
> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Indexing and searching using Apache Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi

On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
<sr...@gmail.com> wrote:
> Hi Rupert,
>
> and thank you for the answer. I need to read few more things, but the answer
> helped me a lot.

great!

> If I understood well, the search is case sensitive, and if I need case
> insensitive search, I will have to implement application specific logic?
>

Keyword searches via the content hub and Solr query for the field
"text_all" are case insensitive!

Only searches for the fields "organizations_t", "people_t" and
"places_t" are case sensitive. However I would consider this as a bug
and the comment (**) in my previous mail suggests to correct that.


best
Rupert

> Best,
> Srecko
>
>
> On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
> <ru...@gmail.com> wrote:
>>
>> Hi Srecko, all
>>
>> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>>
>> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>> <sr...@gmail.com> wrote:
>> >
>> > Until now I have developed few applications for annotating documents
>> > using
>> > Apache Stanbol. Now I need to add indexing and search capabilities.
>> >
>> > I tried ContentHub
>> >
>> > (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>> > in the way that I started full launcher and access web interface. There
>> > are
>> > few possibilities: to provide text, to upload document, to provide an
>> > URI… I
>> > tried to upload a few txt documents. I didn’t get any extracted
>> > entities,
>>
>> The content hub shows the number of extracted enhancements. This can
>> easily be used as indicator if the Stanbol Enhancer was able to
>> extract knowledge form the parsed content.
>>
>> Typical reasons for not getting expected enhancement results are:
>>
>> 1. unsupported content type: The current version of Apache Stanbol
>> uses the
>> [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html)
>> to process non-plain-text content parsed to the Stanbol
>> Enhancer/Contenthub. So everything that is covered by Apache Tika
>> should also work just fine with Apache Stanbol.
>>
>> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>> Entity Recognition) do only support some languages. If the parsed
>> content is in an other language the will not be able to process the
>> parsed content. With the default configuration of Stanbol only english
>> (and in the newest version spanish and dutch) documents will work.
>> Users with custom configurations will also be able to process
>> documents with other languages)
>>
>> > but search (using Web View) worked fine.
>>
>> This is because the Conenthub also supports full text search over the
>> parsed content. (*)
>>
>> >Another step was to upload pdf
>> > documents and I got extracted entities grouped by People, Places
>> > Concepts
>> > categories. It was also in the list of recently uploaded documents, but
>> > I
>> > couldn’t find any term from that document.
>> >
>>
>> Based on your request I tried the following (with the default
>> configuration of the Full launcher)
>> NOTE: this excludes the possibility to create your own search index by
>> using LDPath.
>>
>> 1) upload some files to the content hub
>>
>>    * file upload worked (some scientific papers from the local HD
>>    * URL upload worked (some technical blogs + comments)
>>    * pasting text worked (some of the examples included for the enhancer)
>>    * based on the UI I got > 100 enhancements for all tested PDFs
>>
>> 2) test of the contenthub search
>>
>>    * keyword search worked also for me
>>
>> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>>
>>    * searches like
>> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>> worked fine. Note that searches are case sensitive (**)
>>    * I think the keyword search uses the "text_all" field. So queries
>> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>> return the same values as the UI of the content hub. This fields
>> basically supports full text search.
>>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>> *_workinstitutions ...) where missing. I think this is expected,
>> because such fields do require a dbpedia index with the required
>> fields.
>>
>>
>> >
>> > I suppose that I will have to provide a stream from pdf (or any other
>> > kind)
>> > documents and to index it like text? I need all mentioned
>> > functionalities
>> > (index text, docs, URIs…) using Java application and I would appreciate
>> > a
>> > code example, if it is available, please.
>> >
>>
>> I think parsing of URIs is currently not possible by using the RESTful
>> API. For using the RESTful services I would recommend you the use of
>> the Apache Http commons client. Code examples on how to build requests
>> can be found at
>>
>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>>
>>
>> best
>> Rupert
>>
>> Comments intended for Stanbol Developers:
>> -----
>>
>> (*) Normally I would expect the SolrIndex to only include the plain
>> text version of the parsed content within a field with stored=false.
>> However I assume that currently the index needs to store the actual
>> content, because is is also used to store the data. Is this correct?
>> If this is the case than it will get fixed with STANBOL-471 in any case.
>>
>> I also noted that "stanbolreserved_content" currently stores the
>> content as parsed to the content hub but is configured as
>> indexed="true" and type="text_general". So in case of an PDF file the
>> binary content is processed as natural language text AND is also
>> indexed!
>> So if this field is used for full text indexing (what I think is not
>> the case, because I think the "text_all" field is used for that) than
>> you need to ensure that the plain text version is used for full text
>> indexing. The plain text contents are available from enhanced
>> ContentItems by using
>>
>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>> As an alternative one could also use the features introduced by
>> STANBOL-500 for this.
>> If this field is used to store the actual content, than you should use
>> an binary field type and deactivate indexing for this field.
>>
>> (**) All *_t fields use string as field type. This means that no
>> tokenizer is used AND queries are case sensitive. I do not think this
>> is a good decision and would rather us the already defined "text_ws"
>> type (white space tokenizer, word delimiter and lower case)
>>
>>
>> best
>> Rupert
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Indexing and searching using Apache Stanbol

Posted by srecko joksimovic <sr...@gmail.com>.
Hi Rupert,

and thank you for the answer. I need to read few more things, but the
answer helped me a lot.
If I understood well, the search is case sensitive, and if I need case
insensitive search, I will have to implement application specific logic?

Best,
Srecko

On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi Srecko, all
>
> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>
> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
> <sr...@gmail.com> wrote:
> >
> > Until now I have developed few applications for annotating documents
> using
> > Apache Stanbol. Now I need to add indexing and search capabilities.
> >
> > I tried ContentHub
> > (
> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
> > in the way that I started full launcher and access web interface. There
> are
> > few possibilities: to provide text, to upload document, to provide an
> URI… I
> > tried to upload a few txt documents. I didn’t get any extracted entities,
>
> The content hub shows the number of extracted enhancements. This can
> easily be used as indicator if the Stanbol Enhancer was able to
> extract knowledge form the parsed content.
>
> Typical reasons for not getting expected enhancement results are:
>
> 1. unsupported content type: The current version of Apache Stanbol
> uses the [TikaEngine](
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
> )
> to process non-plain-text content parsed to the Stanbol
> Enhancer/Contenthub. So everything that is covered by Apache Tika
> should also work just fine with Apache Stanbol.
>
> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
> Entity Recognition) do only support some languages. If the parsed
> content is in an other language the will not be able to process the
> parsed content. With the default configuration of Stanbol only english
> (and in the newest version spanish and dutch) documents will work.
> Users with custom configurations will also be able to process
> documents with other languages)
>
> > but search (using Web View) worked fine.
>
> This is because the Conenthub also supports full text search over the
> parsed content. (*)
>
> >Another step was to upload pdf
> > documents and I got extracted entities grouped by People, Places Concepts
> > categories. It was also in the list of recently uploaded documents, but I
> > couldn’t find any term from that document.
> >
>
> Based on your request I tried the following (with the default
> configuration of the Full launcher)
> NOTE: this excludes the possibility to create your own search index by
> using LDPath.
>
> 1) upload some files to the content hub
>
>    * file upload worked (some scientific papers from the local HD
>    * URL upload worked (some technical blogs + comments)
>    * pasting text worked (some of the examples included for the enhancer)
>    * based on the UI I got > 100 enhancements for all tested PDFs
>
> 2) test of the contenthub search
>
>    * keyword search worked also for me
>
> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>
>    * searches like
> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
> worked fine. Note that searches are case sensitive (**)
>    * I think the keyword search uses the "text_all" field. So queries
> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
> return the same values as the UI of the content hub. This fields
> basically supports full text search.
>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
> *_workinstitutions ...) where missing. I think this is expected,
> because such fields do require a dbpedia index with the required
> fields.
>
>
> >
> > I suppose that I will have to provide a stream from pdf (or any other
> kind)
> > documents and to index it like text? I need all mentioned functionalities
> > (index text, docs, URIs…) using Java application and I would appreciate a
> > code example, if it is available, please.
> >
>
> I think parsing of URIs is currently not possible by using the RESTful
> API. For using the RESTful services I would recommend you the use of
> the Apache Http commons client. Code examples on how to build requests
> can be found at
>
> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>
>
> best
> Rupert
>
> Comments intended for Stanbol Developers:
> -----
>
> (*) Normally I would expect the SolrIndex to only include the plain
> text version of the parsed content within a field with stored=false.
> However I assume that currently the index needs to store the actual
> content, because is is also used to store the data. Is this correct?
> If this is the case than it will get fixed with STANBOL-471 in any case.
>
> I also noted that "stanbolreserved_content" currently stores the
> content as parsed to the content hub but is configured as
> indexed="true" and type="text_general". So in case of an PDF file the
> binary content is processed as natural language text AND is also
> indexed!
> So if this field is used for full text indexing (what I think is not
> the case, because I think the "text_all" field is used for that) than
> you need to ensure that the plain text version is used for full text
> indexing. The plain text contents are available from enhanced
> ContentItems by using
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> As an alternative one could also use the features introduced by
> STANBOL-500 for this.
> If this field is used to store the actual content, than you should use
> an binary field type and deactivate indexing for this field.
>
> (**) All *_t fields use string as field type. This means that no
> tokenizer is used AND queries are case sensitive. I do not think this
> is a good decision and would rather us the already defined "text_ws"
> type (white space tokenizer, word delimiter and lower case)
>
>
> best
> Rupert
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Indexing and searching using Apache Stanbol

Posted by Ali Anil Sinaci <a....@gmail.com>.
On 03/13/2012 12:46 PM, Rupert Westenthaler wrote:
>
> Comments intended for Stanbol Developers:
> -----
>
> (*) Normally I would expect the SolrIndex to only include the plain
> text version of the parsed content within a field with stored=false.
> However I assume that currently the index needs to store the actual
> content, because is is also used to store the data. Is this correct?
> If this is the case than it will get fixed with STANBOL-471 in any case.
>
> I also noted that "stanbolreserved_content" currently stores the
> content as parsed to the content hub but is configured as
> indexed="true" and type="text_general". So in case of an PDF file the
> binary content is processed as natural language text AND is also
> indexed!
> So if this field is used for full text indexing (what I think is not
> the case, because I think the "text_all" field is used for that) than
> you need to ensure that the plain text version is used for full text
> indexing. The plain text contents are available from enhanced
> ContentItems by using
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> As an alternative one could also use the features introduced by
> STANBOL-500 for this.
> If this field is used to store the actual content, than you should use
> an binary field type and deactivate indexing for this field.
>
> (**) All *_t fields use string as field type. This means that no
> tokenizer is used AND queries are case sensitive. I do not think this
> is a good decision and would rather us the already defined "text_ws"
> type (white space tokenizer, word delimiter and lower case)
>
>
> best
> Rupert
>

Hi,

I want to give some information about my last commit. It applies some 
changes to the default Contenthub index on Solr.

(*) "stanbolreserved_content" indexes the text content of the document, 
but not stored.
(*) "stanbolreserved_binarycontent" only stores the binary content, not 
indexed.
STANBOL-471 will most probably remove these issues. For demo purposes, 
we may continue to store the not-indexed binary content.

(**) "*_t" continues with the "string" type because we want to provide 
the faceted search with the name of extracted entities in the web GUI of 
Contenthub. Therefore, it is stored and indexed.
(**) "*_i" is added to the schema with "text_ws" type. This type uses 
the WhitespaceTokenizerFactory, WordDelimiterFilterFactory, 
LowerCaseFilterFactory and RemoveDuplicatesTokenFilterFactory of Solr. 
So, for dynamic fields if "x_t" exists, "x_i" also exists. This field is 
neither stored nor copied to "stanbolreserved_text_all". (BTW, I renamed 
"text_all" --> "stanbolreserved_text_all").
(**) Since "*_t" fields are being copied to "stanbolreserved_text_all", 
values of these fields are indexed through "text_general" type. If you 
want to search on a specific field, you can use the one ends with "_i" 
instead of "_t".

Best,
Anil.

Re: Indexing and searching using Apache Stanbol

Posted by Ali Anil Sinaci <a....@gmail.com>.
On 03/13/2012 02:34 PM, Rupert Westenthaler wrote:
> On Tue, Mar 13, 2012 at 1:13 PM, Suat Gonul<su...@gmail.com>  wrote:
>>> (**) All *_t fields use string as field type. This means that no
>>> tokenizer is used AND queries are case sensitive. I do not think this
>>> is a good decision and would rather us the already defined "text_ws"
>>> type (white space tokenizer, word delimiter and lower case)
>>>
>>>
>> Ok, thanks for this suggestion. Indeed, it might be better to set it to
>> "text_general". WDYT?
>>
> I am not sure about text_general because it is specific to the english
> language. So if you can ensure that such labels will be all English,
> than it might be still ok, but otherwise I would prefer a non language
> specific filed such as "text_ws"
>
> We might also want to consider to use
>
> * ICUTokenizerFactory instead of the WhitespaceTokenizerFactory do
> also cover languages that do not use whitespaces to separate words.
> * ICUFoldingFilterFactory (combination of ASCIIFoldingFilter,
> LowerCaseFilter, and ICUNormalizer2Filter)
>
>
> That brings me to an other question: How does the Contenthub currently
> deal with internationalization?
>
> best
> Rupert
>
>
Default Contenthub index uses "text_general", this field uses following 
language specific operations in addition to the generic ones:
* StopFilterFactory (default Contenthub comes with English and German 
stopwords)
* SnowballPorterFilterFactory for English.

So, to remove the dependency to English, we can switch "text_general" 
with "text_ws".

Furthermore, LDPath integration of Contenthub currently does not 
consider the language tags inside the LDPath programs. It only resolves 
the default XSD types. We can add this to out todo list, considering the 
language tags inside the programs (if they exist) while determining the 
type of the Solr fields.

Best,
Anil.


Re: Indexing and searching using Apache Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Tue, Mar 13, 2012 at 1:13 PM, Suat Gonul <su...@gmail.com> wrote:
>> (**) All *_t fields use string as field type. This means that no
>> tokenizer is used AND queries are case sensitive. I do not think this
>> is a good decision and would rather us the already defined "text_ws"
>> type (white space tokenizer, word delimiter and lower case)
>>
>>
>
> Ok, thanks for this suggestion. Indeed, it might be better to set it to
> "text_general". WDYT?
>
I am not sure about text_general because it is specific to the english
language. So if you can ensure that such labels will be all English,
than it might be still ok, but otherwise I would prefer a non language
specific filed such as "text_ws"

We might also want to consider to use

* ICUTokenizerFactory instead of the WhitespaceTokenizerFactory do
also cover languages that do not use whitespaces to separate words.
* ICUFoldingFilterFactory (combination of ASCIIFoldingFilter,
LowerCaseFilter, and ICUNormalizer2Filter)


That brings me to an other question: How does the Contenthub currently
deal with internationalization?

best
Rupert


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Indexing and searching using Apache Stanbol

Posted by Suat Gonul <su...@gmail.com>.
Hi Rupert, all,

First of all, thanks for your feedback.

On 03/13/2012 12:46 PM, Rupert Westenthaler wrote:
> Hi Srecko, all
>
> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>
> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
> <sr...@gmail.com> wrote:
>> Until now I have developed few applications for annotating documents using
>> Apache Stanbol. Now I need to add indexing and search capabilities.
>>
>> I tried ContentHub
>> (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>> in the way that I started full launcher and access web interface. There are
>> few possibilities: to provide text, to upload document, to provide an URI… I
>> tried to upload a few txt documents. I didn’t get any extracted entities,
> The content hub shows the number of extracted enhancements. This can
> easily be used as indicator if the Stanbol Enhancer was able to
> extract knowledge form the parsed content.
>
> Typical reasons for not getting expected enhancement results are:
>
> 1. unsupported content type: The current version of Apache Stanbol
> uses the [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html)
> to process non-plain-text content parsed to the Stanbol
> Enhancer/Contenthub. So everything that is covered by Apache Tika
> should also work just fine with Apache Stanbol.
>
> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
> Entity Recognition) do only support some languages. If the parsed
> content is in an other language the will not be able to process the
> parsed content. With the default configuration of Stanbol only english
> (and in the newest version spanish and dutch) documents will work.
> Users with custom configurations will also be able to process
> documents with other languages)
>
>> but search (using Web View) worked fine.
> This is because the Conenthub also supports full text search over the
> parsed content. (*)
>
>> Another step was to upload pdf
>> documents and I got extracted entities grouped by People, Places Concepts
>> categories. It was also in the list of recently uploaded documents, but I
>> couldn’t find any term from that document.
>>
> Based on your request I tried the following (with the default
> configuration of the Full launcher)
> NOTE: this excludes the possibility to create your own search index by
> using LDPath.
>
> 1) upload some files to the content hub
>
>     * file upload worked (some scientific papers from the local HD
>     * URL upload worked (some technical blogs + comments)
>     * pasting text worked (some of the examples included for the enhancer)
>     * based on the UI I got > 100 enhancements for all tested PDFs
>
> 2) test of the contenthub search
>
>     * keyword search worked also for me
>
> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>
>     * searches like
> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
> worked fine. Note that searches are case sensitive (**)
>     * I think the keyword search uses the "text_all" field. So queries
> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
> return the same values as the UI of the content hub. This fields
> basically supports full text search.
>     * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
> *_workinstitutions ...) where missing. I think this is expected,
> because such fields do require a dbpedia index with the required
> fields.
>
>
>> I suppose that I will have to provide a stream from pdf (or any other kind)
>> documents and to index it like text? I need all mentioned functionalities
>> (index text, docs, URIs…) using Java application and I would appreciate a
>> code example, if it is available, please.
>>
> I think parsing of URIs is currently not possible by using the RESTful
> API. For using the RESTful services I would recommend you the use of
> the Apache Http commons client. Code examples on how to build requests
> can be found at
> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>
>
> best
> Rupert
>
> Comments intended for Stanbol Developers:
> -----
>
> (*) Normally I would expect the SolrIndex to only include the plain
> text version of the parsed content within a field with stored=false.
> However I assume that currently the index needs to store the actual
> content, because is is also used to store the data. Is this correct?
> If this is the case than it will get fixed with STANBOL-471 in any case.
>
Exactly.

> I also noted that "stanbolreserved_content" currently stores the
> content as parsed to the content hub but is configured as
> indexed="true" and type="text_general". So in case of an PDF file the
> binary content is processed as natural language text AND is also
> indexed!
> So if this field is used for full text indexing (what I think is not
> the case, because I think the "text_all" field is used for that) than
> you need to ensure that the plain text version is used for full text
> indexing. The plain text contents are available from enhanced
> ContentItems by using
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> As an alternative one could also use the features introduced by
> STANBOL-500 for this.
> If this field is used to store the actual content, than you should use
> an binary field type and deactivate indexing for this field.
Currently, all fields are copied to "text_all" and indexed within this
field. But, most of these fields (such as "stanbolreserved_content") are
also indexed which is false as you pointed out. Furthermore, binary
content has not been considered at all in the Solr index. We will find
shortcut solutions as soon as possible and leave the actual solutions to
the implementation of STANBOL-471.

> (**) All *_t fields use string as field type. This means that no
> tokenizer is used AND queries are case sensitive. I do not think this
> is a good decision and would rather us the already defined "text_ws"
> type (white space tokenizer, word delimiter and lower case)
>
>

Ok, thanks for this suggestion. Indeed, it might be better to set it to
"text_general". WDYT?

Best,
Suat

> best
> Rupert
>


Re: Indexing and searching using Apache Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Srecko, all

@Stanbol developers: Note (*) and (**) comments at the end of this mail

On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
<sr...@gmail.com> wrote:
>
> Until now I have developed few applications for annotating documents using
> Apache Stanbol. Now I need to add indexing and search capabilities.
>
> I tried ContentHub
> (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
> in the way that I started full launcher and access web interface. There are
> few possibilities: to provide text, to upload document, to provide an URI… I
> tried to upload a few txt documents. I didn’t get any extracted entities,

The content hub shows the number of extracted enhancements. This can
easily be used as indicator if the Stanbol Enhancer was able to
extract knowledge form the parsed content.

Typical reasons for not getting expected enhancement results are:

1. unsupported content type: The current version of Apache Stanbol
uses the [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html)
to process non-plain-text content parsed to the Stanbol
Enhancer/Contenthub. So everything that is covered by Apache Tika
should also work just fine with Apache Stanbol.

2. unsupported language: Some Enhancement Engines (e.g. NER - Named
Entity Recognition) do only support some languages. If the parsed
content is in an other language the will not be able to process the
parsed content. With the default configuration of Stanbol only english
(and in the newest version spanish and dutch) documents will work.
Users with custom configurations will also be able to process
documents with other languages)

> but search (using Web View) worked fine.

This is because the Conenthub also supports full text search over the
parsed content. (*)

>Another step was to upload pdf
> documents and I got extracted entities grouped by People, Places Concepts
> categories. It was also in the list of recently uploaded documents, but I
> couldn’t find any term from that document.
>

Based on your request I tried the following (with the default
configuration of the Full launcher)
NOTE: this excludes the possibility to create your own search index by
using LDPath.

1) upload some files to the content hub

    * file upload worked (some scientific papers from the local HD
    * URL upload worked (some technical blogs + comments)
    * pasting text worked (some of the examples included for the enhancer)
    * based on the UI I got > 100 enhancements for all tested PDFs

2) test of the contenthub search

    * keyword search worked also for me

3) direct solr searches on {host}/solr/default/contenthub/ (*)

    * searches like
"{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
worked fine. Note that searches are case sensitive (**)
    * I think the keyword search uses the "text_all" field. So queries
for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
return the same values as the UI of the content hub. This fields
basically supports full text search.
    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
*_workinstitutions ...) where missing. I think this is expected,
because such fields do require a dbpedia index with the required
fields.


>
> I suppose that I will have to provide a stream from pdf (or any other kind)
> documents and to index it like text? I need all mentioned functionalities
> (index text, docs, URIs…) using Java application and I would appreciate a
> code example, if it is available, please.
>

I think parsing of URIs is currently not possible by using the RESTful
API. For using the RESTful services I would recommend you the use of
the Apache Http commons client. Code examples on how to build requests
can be found at
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html


best
Rupert

Comments intended for Stanbol Developers:
-----

(*) Normally I would expect the SolrIndex to only include the plain
text version of the parsed content within a field with stored=false.
However I assume that currently the index needs to store the actual
content, because is is also used to store the data. Is this correct?
If this is the case than it will get fixed with STANBOL-471 in any case.

I also noted that "stanbolreserved_content" currently stores the
content as parsed to the content hub but is configured as
indexed="true" and type="text_general". So in case of an PDF file the
binary content is processed as natural language text AND is also
indexed!
So if this field is used for full text indexing (what I think is not
the case, because I think the "text_all" field is used for that) than
you need to ensure that the plain text version is used for full text
indexing. The plain text contents are available from enhanced
ContentItems by using
ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
As an alternative one could also use the features introduced by
STANBOL-500 for this.
If this field is used to store the actual content, than you should use
an binary field type and deactivate indexing for this field.

(**) All *_t fields use string as field type. This means that no
tokenizer is used AND queries are case sensitive. I do not think this
is a good decision and would rather us the already defined "text_ws"
type (white space tokenizer, word delimiter and lower case)


best
Rupert

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen