You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@stanbol.apache.org by "Rupert Westenthaler (Created) (JIRA)" <ji...@apache.org> on 2012/04/05 09:33:25 UTC

[jira] [Created] (STANBOL-577) Add Interfaces for parsing Content

Add Interfaces for parsing Content
----------------------------------

Key: STANBOL-577
URL: https://issues.apache.org/jira/browse/STANBOL-577
Project: Stanbol
Issue Type: Sub-task
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler

Currently different types of ContentItem define there own constructors that do fit there specific implementation. e.g. the InMemoryBlob defines constructors that allow to parse the content as ByteArray. This makes completely sense for this implementation, because directly allows to parse the data if they are already loaded in memory. The WebContentItem as an other example can not support a Constructor taking a byte array, because at the time of construction only the URL of - reference to - the content is available. Also for a File based ContentItem implementation a constructor with an byte array would not be preferable as the whole point of such an implementation would be to avoid to load the whole content in memory.

However with the introduction of a factory pattern to construct ContentItems the interfaces used to parse content MUST be normalized - because they are part of the API of the ContentItemFactory interface. To solve this the following two interfaces are added to the Stanbol Enhancer API

First the __ContentSource__ interface intended to be used for already dereferenced content

** the content as stream */
+ getStream() : InputStream
/** the content as byte array */
+ getData() : byte[]
/** optionally the media type of the content */
+ getMediaType() : String
/** optionally the file name of the content */
+ getFileName() : String
/** optionally additional headers */
+ getHeaders() : Map<String,List<String>>

With the following default implementations:

* StreamSource: A ContentSource wrapping an InputStream. Multiple calls to #getStream() will not be supported. Calls to #getData() will load the contents provided by the stream into memory.
* ByteArraySource: A ContentSource implementation that internally uses a byte array. To be used in cases where users need to parse content to the Stanbol Enhancer that is already loaded in-memory. Calls to #getData() MUST NOT copy the internal byte array.
* StringSource: A ContentSource implementation that directly allows to parse a String instance.

Note that ContentItem/Blob implementations that

* store the content in-memory should prefer to call ContentSource#getData() to retrieve the content from the ContentSource
* stream the content to a file/database/CMS need to use ContentSource#getStream() to avoid loading the whole content in-memory!

Second the __ContentReference__ interface intended to be used to create ContentItems/Blons for content where only a reference is available.

/** the Reference to the content */
+ gerReference() : String
/** dereferences the content */
+ dereference() : ContentSource

With the following default implementation:

* UrlReference: Allows to use any Java URL to reference a Content. This basically is a replacement for the current WebContentItem implementation.

Both interfaces and implementations will be part of the Stanbol Enhancer Services API module.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (STANBOL-577) Add Interfaces for parsing Content

Posted by "Rupert Westenthaler (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-577.
-----------------------------------------

    Resolution: Fixed

implemented and documented with #1324645
                
> Add Interfaces for parsing Content
> ----------------------------------
>
>                 Key: STANBOL-577
>                 URL: https://issues.apache.org/jira/browse/STANBOL-577
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently different types of ContentItem define there own constructors that do fit there specific implementation. e.g. the InMemoryBlob defines constructors that allow to parse the content as ByteArray. This makes completely sense for this implementation, because directly allows to parse the data if they are already loaded in memory. The WebContentItem as an other example can not support a Constructor taking a byte array, because at the time of construction only the URL of - reference to - the content is available. Also for a File based ContentItem implementation a constructor with an byte array would not be preferable as the whole point of such an implementation would be to avoid to load the whole content in memory.
> However with the introduction of a factory pattern to construct ContentItems the interfaces used to parse content MUST be normalized - because they are part of the API of the ContentItemFactory interface. To solve this the following two interfaces are added to the Stanbol Enhancer API
> First the __ContentSource__ interface intended to be used for already dereferenced content
>     ** the content as stream */
>     + getStream() : InputStream
>     /** the content as byte array */
>     + getData() : byte[]
>     /** optionally the media type of the content */
>     + getMediaType() : String
>     /** optionally the file name of the content */
>     + getFileName() : String
>     /** optionally additional headers */
>     + getHeaders() : Map<String,List<String>>
>         
> With the following default implementations:
> * StreamSource: A ContentSource wrapping an InputStream. Multiple calls to #getStream() will not be supported. Calls to #getData() will load the contents provided by the stream into memory.
> * ByteArraySource: A ContentSource implementation that internally uses a byte array. To be used in cases where users need to parse content to the Stanbol Enhancer that is already loaded in-memory. Calls to #getData() MUST NOT copy the internal byte array. 
> * StringSource: A ContentSource implementation that directly allows to parse a String instance.
> Note that ContentItem/Blob implementations that
> * store the content in-memory should prefer to call ContentSource#getData() to retrieve the content from the ContentSource
> * stream the content to a file/database/CMS need to use ContentSource#getStream() to avoid loading the whole content in-memory!
> Second the __ContentReference__ interface intended to be used to create ContentItems/Blons for content where only a reference is available.
>     /** the Reference to the content */
>     + gerReference() : String
>     /** dereferences the content */
>     + dereference() : ContentSource
>     
> With the following default implementation:
> * UrlReference: Allows to use any Java URL to reference a Content. This basically is a replacement for the current WebContentItem implementation.
> Both interfaces and implementations will be part of the Stanbol Enhancer Services API module.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira