You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2018/03/15 08:29:00 UTC

[jira] [Comment Edited] (TIKA-2602) iCalendar not properly recognized as text/calendar

    [ https://issues.apache.org/jira/browse/TIKA-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400071#comment-16400071 ] 

Andreas Meier edited comment on TIKA-2602 at 3/15/18 8:28 AM:
--------------------------------------------------------------

Unfortunately the above mentioned mime-type broke the text/x-vcalendar recognition.

 
Ended up with the following regexes and at the moment all available testfiles are identified correctly:
 
{code:xml}
<mime-type type="text/calendar">
  <magic priority="50">
    <match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
      <match value="(?s).*\\nVERSION\\s*:2\\.0" type="regex" offset="15" />
    </match>
  </magic>
  <glob pattern="*.ics"/>
  <glob pattern="*.ifb"/>
  <sub-class-of type="text/plain"/>
</mime-type>
{code}


{code:xml}
<mime-type type="text/x-vcalendar">
  <magic priority="50">
    <match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
      <match value="(?s).*\\nVERSION\\s*:1\\.0" type="regex" offset="15" />
    </match>
  </magic>
  <glob pattern="*.vcs"/>
  <sub-class-of type="text/plain"/>
</mime-type>
{code}

I even tried to create an easier match, but there are too many special cases so I had to create matching regexes.

The main problem is that the testfiles are not rfc5545 conform:
- Can't rely on BEGIN:VCALENDAR being all uppercase
- Missing CR LF at the end of the file
- Missing CR at the end of lines

(I even thought about ignoring case of the VERSION-string by setting (?i), but there was no testfile of this case so I left it untouched)


There might be still a problem with .ics or .vcs created on Mac, if the files contain CR instead of LF.
Since I can't create a testfiles for mac,I would be happy if someone could provide some files.
(The leading [\\n] in front of VERSION might need to be exchanged then...)


was (Author: andreasmeier):
Unfortunately the above mentioned mime-type broke the text/x-vcalendar recognition.

 
Ended up with the following regexes and at the moment all available testfiles are identified correctly:
 
{code:xml}
<mime-type type="text/calendar">
  <magic priority="50">
    <match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
      <match value="(?s).*\\nVERSION\\s*:2\\.0" type="regex" offset="15" />
    </match>
  </magic>
  <glob pattern="*.ics"/>
  <glob pattern="*.ifb"/>
  <sub-class-of type="text/plain"/>
</mime-type>
{code}


{code:xml}
<mime-type type="text/x-vcalendar">
  <magic priority="50">
    <match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
      <match value="(?s).*\\nVERSION\\s*:1\\.0" type="regex" offset="15" />
    </match>
  </magic>
  <glob pattern="*.vcs"/>
  <sub-class-of type="text/plain"/>
</mime-type>
{code}

I even tried to create an easier match, but there are too many special cases so I had to create matching regexes.

The main problem is that the testfiles are not rfc5545 conform:
- Can't rely on BEGIN:VCALENDAR being all uppercase
- Missing CR LF at the end of the file
- Missing CR at the end of lines

(I even thought about ignoring case of the VERSION-string by setting (?i), but there was no testfile of this case so I left it untouched)


There might be still a problem with .ics or .vcs created on Mac, if the files contain \\r instead of \\n.
Since I can't create a testfiles for mac,* I would be happy if someone could provide some files*.
(The leading \\n might need to be exchanged with [\\r|\\n] then...)

> iCalendar not properly recognized as text/calendar
> --------------------------------------------------
>
>                 Key: TIKA-2602
>                 URL: https://issues.apache.org/jira/browse/TIKA-2602
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Andreas Meier
>            Priority: Major
>         Attachments: VERSION_Test
>
>
> At the moment the detection of text/calender is covered by the following mime-type-element:
> {code:xml}
>   <mime-type type="text/calendar">
>     <magic priority="50">
>       <match value="BEGIN:VCALENDAR" type="string" offset="0">
>         <match value="VERSION:2.0" type="string" offset="15:30"/>
>       </match>
>     </magic>
>     <glob pattern="*.ics"/>
>     <glob pattern="*.ifb"/>
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> {code}
> This recognition will fail, if VERSION:2.0 is not the first property after BEGIN:VCALENDAR.
> Since this is not always the case (check: [https://tools.ietf.org/html/rfc5545|https://tools.ietf.org/html/rfc5545] 3.6. Calendar Components) recognition may fail for calendar objects with PRODID or other properties:
>  Section "4. iCalendar Object Examples" shows some of these cases:
> {code}
>        BEGIN:VCALENDAR
>        PRODID:-//xyz Corp//NONSGML PDA Calendar Version 1.0//EN
>        VERSION:2.0
>        BEGIN:VEVENT
>        DTSTAMP:19960704T120000Z
>        UID:uid1@example.com
>        ORGANIZER:mailto:jsmith@example.com
>        DTSTART:19960918T143000Z
>        DTEND:19960920T220000Z
>        STATUS:CONFIRMED
>        CATEGORIES:CONFERENCE
>        SUMMARY:Networld+Interop Conference
>        DESCRIPTION:Networld+Interop Conference
>          and Exhibit\nAtlanta World Congress Center\n
>         Atlanta\, Georgia
>        END:VEVENT
>        END:VCALENDAR
> {code}
> or
> {code}
>        BEGIN:VCALENDAR
>        METHOD:xyz
>        VERSION:2.0
>        PRODID:-//ABC Corporation//NONSGML My Product//EN
>        BEGIN:VEVENT
>        DTSTAMP:19970324T120000Z
>        SEQUENCE:0
>        UID:uid3@example.com
>        ORGANIZER:mailto:jdoe@example.com
>        ATTENDEE;RSVP=TRUE:mailto:jsmith@example.com
>        DTSTART:19970324T123000Z
>        DTEND:19970324T210000Z
>        CATEGORIES:MEETING,PROJECT
>        CLASS:PUBLIC
>        SUMMARY:Calendaring Interoperability Planning Meeting
>        DESCRIPTION:Discuss how we can test c&s interoperability\n
>         using iCalendar and other IETF standards.
>        LOCATION:LDB Lobby
>        ATTACH;FMTTYPE=application/postscript:ftp://example.com/pub/
>         conf/bkgrnd.ps
>        END:VEVENT
>        END:VCALENDAR
> {code}
> I suggest to either 
> a) widen the offset of the VERSION-match from 15:30 to 15:200 or sth. like that (not so good approach, since we don't know how Long the PRODID might be) 
> or
> b) to add sub-matches for CALSCALE, PRODID, METHOD. (This might still not cover everything, since there are x-prop and iana-prop properties. For now I can only confirm that there are PRODID or METHOD as first property after BEGIN:VCALENDAR.)
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)