You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luke sh (JIRA)" <ji...@apache.org> on 2015/04/21 08:43:58 UTC
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

     [ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Luke sh updated TIKA-1610:
--------------------------
    Description: 
CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too 
<glob pattern="*.cbor"/>



  was:
CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the CommonCrawlDataDumper is a tool that comes with Nutch and it is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support the parsing and detecting, the surprise is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too 
<glob pattern="*.cbor"/>




> CBOR Parser and detection improvement
> -------------------------------------
>
>                 Key: TIKA-1610
>                 URL: https://issues.apache.org/jira/browse/TIKA-1610
>             Project: Tika
>          Issue Type: New Feature
>          Components: detector, mime, parser
>    Affects Versions: 1.7
>            Reporter: Luke sh
>            Priority: Trivial
>              Labels: memex
>
> CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. 
> On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too 
> <glob pattern="*.cbor"/>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)