You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2018/03/15 08:29:00 UTC
[jira] [Comment Edited] (TIKA-2602) iCalendar not properly
recognized as text/calendar
[ https://issues.apache.org/jira/browse/TIKA-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400071#comment-16400071 ]
Andreas Meier edited comment on TIKA-2602 at 3/15/18 8:28 AM:
--------------------------------------------------------------
Unfortunately the above mentioned mime-type broke the text/x-vcalendar recognition.
Ended up with the following regexes and at the moment all available testfiles are identified correctly:
{code:xml}
<mime-type type="text/calendar">
<magic priority="50">
<match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
<match value="(?s).*\\nVERSION\\s*:2\\.0" type="regex" offset="15" />
</match>
</magic>
<glob pattern="*.ics"/>
<glob pattern="*.ifb"/>
<sub-class-of type="text/plain"/>
</mime-type>
{code}
{code:xml}
<mime-type type="text/x-vcalendar">
<magic priority="50">
<match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
<match value="(?s).*\\nVERSION\\s*:1\\.0" type="regex" offset="15" />
</match>
</magic>
<glob pattern="*.vcs"/>
<sub-class-of type="text/plain"/>
</mime-type>
{code}
I even tried to create an easier match, but there are too many special cases so I had to create matching regexes.
The main problem is that the testfiles are not rfc5545 conform:
- Can't rely on BEGIN:VCALENDAR being all uppercase
- Missing CR LF at the end of the file
- Missing CR at the end of lines
(I even thought about ignoring case of the VERSION-string by setting (?i), but there was no testfile of this case so I left it untouched)
There might be still a problem with .ics or .vcs created on Mac, if the files contain CR instead of LF.
Since I can't create a testfiles for mac,I would be happy if someone could provide some files.
(The leading [\\n] in front of VERSION might need to be exchanged then...)
was (Author: andreasmeier):
Unfortunately the above mentioned mime-type broke the text/x-vcalendar recognition.
Ended up with the following regexes and at the moment all available testfiles are identified correctly:
{code:xml}
<mime-type type="text/calendar">
<magic priority="50">
<match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
<match value="(?s).*\\nVERSION\\s*:2\\.0" type="regex" offset="15" />
</match>
</magic>
<glob pattern="*.ics"/>
<glob pattern="*.ifb"/>
<sub-class-of type="text/plain"/>
</mime-type>
{code}
{code:xml}
<mime-type type="text/x-vcalendar">
<magic priority="50">
<match value="BEGIN:VCALENDAR" type="stringignorecase" offset="0">
<match value="(?s).*\\nVERSION\\s*:1\\.0" type="regex" offset="15" />
</match>
</magic>
<glob pattern="*.vcs"/>
<sub-class-of type="text/plain"/>
</mime-type>
{code}
I even tried to create an easier match, but there are too many special cases so I had to create matching regexes.
The main problem is that the testfiles are not rfc5545 conform:
- Can't rely on BEGIN:VCALENDAR being all uppercase
- Missing CR LF at the end of the file
- Missing CR at the end of lines
(I even thought about ignoring case of the VERSION-string by setting (?i), but there was no testfile of this case so I left it untouched)
There might be still a problem with .ics or .vcs created on Mac, if the files contain \\r instead of \\n.
Since I can't create a testfiles for mac,* I would be happy if someone could provide some files*.
(The leading \\n might need to be exchanged with [\\r|\\n] then...)
> iCalendar not properly recognized as text/calendar
> --------------------------------------------------
>
> Key: TIKA-2602
> URL: https://issues.apache.org/jira/browse/TIKA-2602
> Project: Tika
> Issue Type: Improvement
> Reporter: Andreas Meier
> Priority: Major
> Attachments: VERSION_Test
>
>
> At the moment the detection of text/calender is covered by the following mime-type-element:
> {code:xml}
> <mime-type type="text/calendar">
> <magic priority="50">
> <match value="BEGIN:VCALENDAR" type="string" offset="0">
> <match value="VERSION:2.0" type="string" offset="15:30"/>
> </match>
> </magic>
> <glob pattern="*.ics"/>
> <glob pattern="*.ifb"/>
> <sub-class-of type="text/plain"/>
> </mime-type>
> {code}
> This recognition will fail, if VERSION:2.0 is not the first property after BEGIN:VCALENDAR.
> Since this is not always the case (check: [https://tools.ietf.org/html/rfc5545|https://tools.ietf.org/html/rfc5545] 3.6. Calendar Components) recognition may fail for calendar objects with PRODID or other properties:
> Section "4. iCalendar Object Examples" shows some of these cases:
> {code}
> BEGIN:VCALENDAR
> PRODID:-//xyz Corp//NONSGML PDA Calendar Version 1.0//EN
> VERSION:2.0
> BEGIN:VEVENT
> DTSTAMP:19960704T120000Z
> UID:uid1@example.com
> ORGANIZER:mailto:jsmith@example.com
> DTSTART:19960918T143000Z
> DTEND:19960920T220000Z
> STATUS:CONFIRMED
> CATEGORIES:CONFERENCE
> SUMMARY:Networld+Interop Conference
> DESCRIPTION:Networld+Interop Conference
> and Exhibit\nAtlanta World Congress Center\n
> Atlanta\, Georgia
> END:VEVENT
> END:VCALENDAR
> {code}
> or
> {code}
> BEGIN:VCALENDAR
> METHOD:xyz
> VERSION:2.0
> PRODID:-//ABC Corporation//NONSGML My Product//EN
> BEGIN:VEVENT
> DTSTAMP:19970324T120000Z
> SEQUENCE:0
> UID:uid3@example.com
> ORGANIZER:mailto:jdoe@example.com
> ATTENDEE;RSVP=TRUE:mailto:jsmith@example.com
> DTSTART:19970324T123000Z
> DTEND:19970324T210000Z
> CATEGORIES:MEETING,PROJECT
> CLASS:PUBLIC
> SUMMARY:Calendaring Interoperability Planning Meeting
> DESCRIPTION:Discuss how we can test c&s interoperability\n
> using iCalendar and other IETF standards.
> LOCATION:LDB Lobby
> ATTACH;FMTTYPE=application/postscript:ftp://example.com/pub/
> conf/bkgrnd.ps
> END:VEVENT
> END:VCALENDAR
> {code}
> I suggest to either
> a) widen the offset of the VERSION-match from 15:30 to 15:200 or sth. like that (not so good approach, since we don't know how Long the PRODID might be)
> or
> b) to add sub-matches for CALSCALE, PRODID, METHOD. (This might still not cover everything, since there are x-prop and iana-prop properties. For now I can only confirm that there are PRODID or METHOD as first property after BEGIN:VCALENDAR.)
> Regards
> Andreas
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)