You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Troy Witthoeft (JIRA)" <ji...@apache.org> on 2011/06/22 18:44:47 UTC

[jira] [Created] (TIKA-679) Proposal for PRT Parser

Proposal for PRT Parser
-----------------------

                 Key: TIKA-679
                 URL: https://issues.apache.org/jira/browse/TIKA-679
             Project: Tika
          Issue Type: Improvement
          Components: mime, parser
    Affects Versions: 0.9
            Reporter: Troy Witthoeft
            Priority: Minor


It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.

{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * It also currently sets some dummy metadata.
 */

public class PRTParser implements Parser {

        private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
        public static final String PRT_MIME_TYPE = "application/prt";
        
		
        public Set<MediaType> getSupportedTypes(ParseContext context) {
                return SUPPORTED_TYPES;
        }
		
        public void parse(
                        InputStream stream, ContentHandler handler,
                        Metadata metadata, ParseContext context)
                        throws IOException, SAXException, TikaException {

						byte[] prefix = new byte[] {0x01, 0x1F};  //
						int pos = 0;
						int read;
						while( (read = stream.read()) > -1) {   //  Reads the next single byte of data (returns byte pos) until you hit the EOF
						  if(read == prefix[pos]) {				//  If the byte being read is 
							pos++;
							if(pos == prefix.length) {
							  // found it!
							  int length = stream.read();
							  int unknown = stream.read();
							  byte[] text = new byte[length];
							  IOUtils.readFully(stream, text);      //reads a selected byte array from the InputStream

							  // turn it into a string, removing null termination
							  // assumes it's found to be utf-8
							  String str = new String(text, 0, text.length, "UTF-8");
							  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
							  xhtml.startElement("p");	
							  xhtml.characters(str);
							  xhtml.endElement("p");
							pos--;  
							}
						  } else {
							pos = 0;
						  }
						}
        }

        /**
         * @deprecated This method will be removed in Apache Tika 1.0.
         */
        public void parse(
                        InputStream stream, ContentHandler handler, Metadata metadata)
                        throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
        }
}
{code}   

I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
Here are my findings.

The file header contains, a magic mime type, file creation date, and file description.
The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]

The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.

The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.

GUIDE 
[33][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA

Any pointers on how to improve the code is appreciated.
 







--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Comment: was deleted

(was: Can you find the text inside?)

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * It also currently sets some dummy metadata.
>  */
> public class PRTParser implements Parser {
>         private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>         public static final String PRT_MIME_TYPE = "application/prt";
>         
> 		
>         public Set<MediaType> getSupportedTypes(ParseContext context) {
>                 return SUPPORTED_TYPES;
>         }
> 		
>         public void parse(
>                         InputStream stream, ContentHandler handler,
>                         Metadata metadata, ParseContext context)
>                         throws IOException, SAXException, TikaException {
> 						byte[] prefix = new byte[] {0x01, 0x1F};  //
> 						int pos = 0;
> 						int read;
> 						while( (read = stream.read()) > -1) {   //  Reads the next single byte of data (returns byte pos) until you hit the EOF
> 						  if(read == prefix[pos]) {				//  If the byte being read is 
> 							pos++;
> 							if(pos == prefix.length) {
> 							  // found it!
> 							  int length = stream.read();
> 							  int unknown = stream.read();
> 							  byte[] text = new byte[length];
> 							  IOUtils.readFully(stream, text);      //reads a selected byte array from the InputStream
> 							  // turn it into a string, removing null termination
> 							  // assumes it's found to be utf-8
> 							  String str = new String(text, 0, text.length, "UTF-8");
> 							  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 							  xhtml.startElement("p");	
> 							  xhtml.characters(str);
> 							  xhtml.endElement("p");
> 							pos--;  
> 							}
> 						  } else {
> 							pos = 0;
> 						  }
> 						}
>         }
>         /**
>          * @deprecated This method will be removed in Apache Tika 1.0.
>          */
>         public void parse(
>                         InputStream stream, ContentHandler handler, Metadata metadata)
>                         throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>         }
> }
> {code}   
> I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
> Here are my findings.
> The file header contains, a magic mime type, file creation date, and file description.
> The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
> If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
> If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]
> The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.
> The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.
> GUIDE 
> [33][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]
> EXAMPLE
> [33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA
> Any pointers on how to improve the code is appreciated.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058065#comment-13058065 ] 

Troy Witthoeft edited comment on TIKA-679 at 6/30/11 9:02 PM:
--------------------------------------------------------------

Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø]

The biggest difficulty in this code is selecting the correct prefix to search for.
The prefix DOES have a pattern.  For instance, note the byte in the green position always represent the length of the text + 1.

PREFIX ROUGH GUIDE
[3#][33][33][33][33][33]{color:blue}[E3][3F]{color}[0#][00][00][0#][00][00][0#][0#][0#][1F]{color:green}[LN]{color}[00]{color:red}[USERINPUT TEXT]{color}[00][xx]

EXAMPLE
[33][33][33][33][33][33]{color:blue}[E3][3F]{color}[00][00][00][00][00][00][00][02][01][1F]{color:green}[05]{color}[00]{color:red}[54][49][4B][41]{color}[00][0B] .... {color:red}TIKA{color}


Once we narrow down text detection, we can move on to extracting the date, and file description.

NOTE: Magic Mime type on this file <match value="0M3C" type="string" offset="8" />


      was (Author: runamok81):
    Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø]

The biggest difficulty in this code is selecting the correct prefix to search for.
The prefix DOES have a pattern.  For instance, note the green byte that always represent the length of the text.

PREFIX ROUGH GUIDE
[3#][33][33][33][33][33]{color:blue}[E3][3F]{color}[0#][00][00][0#][00][00][0#][0#][0#][1F]{color:green}[LN]{color}[00]{color:red}[USERINPUT TEXT]{color}[00][xx]

EXAMPLE
[33][33][33][33][33][33]{color:blue}[E3][3F]{color}[00][00][00][00][00][00][00][02][01][1F]{color:green}[05]{color}[00]{color:red}[54][49][4B][41]{color}[00][0B] .... {color:red}TIKA{color}


Once we narrow down text detection, we can move on to extracting the date, and file description.

NOTE: Magic Mime type on this file <match value="0M3C" type="string" offset="8" />

  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Description: 
It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.

{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * Searches for specific byte prefix, and outputs text from note entities
 * Does not support special DRAFT-PAK characters.
 */

public class PRTParser implements Parser {

    private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
    public static final String PRT_MIME_TYPE = "application/prt";
        	
    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
        }
		
    public void parse(
		InputStream stream, ContentHandler handler,
		Metadata metadata, ParseContext context)
		throws IOException, SAXException, TikaException {
		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
		int pos = 0;											
		int read;
		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
			pos++;													
				if(pos == prefix.length) {								
					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
					stream.skip(1);											
					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
					IOUtils.readFully(stream, text);						
					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
					xhtml.startElement("p");	
					xhtml.characters(str);
					xhtml.endElement("p");
					pos--;  
				}
			} 
			else {
				//Did not find the prefix. Reset the position counter.
				pos = 0;
			}
		}
	}


		
	/**
    * @deprecated This method will be removed in Apache Tika 1.0.
    */
    public void parse(
                   InputStream stream, ContentHandler handler, Metadata metadata)
                   throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
    }
}{code}   

I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
Here are my findings.

The file header contains, a magic mime type, file creation date, and file description.
The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]

The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.

The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.

GUIDE 
[3#][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA

Any pointers on how to improve the code is appreciated.
 







  was:
It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.

{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * It also currently sets some dummy metadata.
 */

public class PRTParser implements Parser {

        private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
        public static final String PRT_MIME_TYPE = "application/prt";
        
		
        public Set<MediaType> getSupportedTypes(ParseContext context) {
                return SUPPORTED_TYPES;
        }
		
        public void parse(
                        InputStream stream, ContentHandler handler,
                        Metadata metadata, ParseContext context)
                        throws IOException, SAXException, TikaException {

						byte[] prefix = new byte[] {0x01, 0x1F};  //
						int pos = 0;
						int read;
						while( (read = stream.read()) > -1) {   //  Reads the next single byte of data (returns byte pos) until you hit the EOF
						  if(read == prefix[pos]) {				//  If the byte being read is 
							pos++;
							if(pos == prefix.length) {
							  // found it!
							  int length = stream.read();
							  int unknown = stream.read();
							  byte[] text = new byte[length];
							  IOUtils.readFully(stream, text);      //reads a selected byte array from the InputStream

							  // turn it into a string, removing null termination
							  // assumes it's found to be utf-8
							  String str = new String(text, 0, text.length, "UTF-8");
							  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
							  xhtml.startElement("p");	
							  xhtml.characters(str);
							  xhtml.endElement("p");
							pos--;  
							}
						  } else {
							pos = 0;
						  }
						}
        }

        /**
         * @deprecated This method will be removed in Apache Tika 1.0.
         */
        public void parse(
                        InputStream stream, ContentHandler handler, Metadata metadata)
                        throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
        }
}
{code}   

I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
Here are my findings.

The file header contains, a magic mime type, file creation date, and file description.
The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]

The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.

The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.

GUIDE 
[33][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA

Any pointers on how to improve the code is appreciated.
 








The biggest difficulty in this code is selecting the correct prefix.
Not sure on the encoding, but UTF-8 seems to be working well
Currently, this parser will get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø] 


> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
> I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
> Here are my findings.
> The file header contains, a magic mime type, file creation date, and file description.
> The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
> If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
> If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]
> The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.
> The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.
> GUIDE 
> [3#][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]
> EXAMPLE
> [33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA
> Any pointers on how to improve the code is appreciated.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Affects Version/s:     (was: 0.9)

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * It also currently sets some dummy metadata.
>  */
> public class PRTParser implements Parser {
>         private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>         public static final String PRT_MIME_TYPE = "application/prt";
>         
> 		
>         public Set<MediaType> getSupportedTypes(ParseContext context) {
>                 return SUPPORTED_TYPES;
>         }
> 		
>         public void parse(
>                         InputStream stream, ContentHandler handler,
>                         Metadata metadata, ParseContext context)
>                         throws IOException, SAXException, TikaException {
> 						byte[] prefix = new byte[] {0x01, 0x1F};  //
> 						int pos = 0;
> 						int read;
> 						while( (read = stream.read()) > -1) {   //  Reads the next single byte of data (returns byte pos) until you hit the EOF
> 						  if(read == prefix[pos]) {				//  If the byte being read is 
> 							pos++;
> 							if(pos == prefix.length) {
> 							  // found it!
> 							  int length = stream.read();
> 							  int unknown = stream.read();
> 							  byte[] text = new byte[length];
> 							  IOUtils.readFully(stream, text);      //reads a selected byte array from the InputStream
> 							  // turn it into a string, removing null termination
> 							  // assumes it's found to be utf-8
> 							  String str = new String(text, 0, text.length, "UTF-8");
> 							  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 							  xhtml.startElement("p");	
> 							  xhtml.characters(str);
> 							  xhtml.endElement("p");
> 							pos--;  
> 							}
> 						  } else {
> 							pos = 0;
> 						  }
> 						}
>         }
>         /**
>          * @deprecated This method will be removed in Apache Tika 1.0.
>          */
>         public void parse(
>                         InputStream stream, ContentHandler handler, Metadata metadata)
>                         throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>         }
> }
> {code}   
> I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
> Here are my findings.
> The file header contains, a magic mime type, file creation date, and file description.
> The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
> If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
> If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]
> The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.
> The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.
> GUIDE 
> [33][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]
> EXAMPLE
> [33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA
> Any pointers on how to improve the code is appreciated.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059639#comment-13059639 ] 

Troy Witthoeft edited comment on TIKA-679 at 7/5/11 8:38 PM:
-------------------------------------------------------------

I've narrowed the encoding down to CP437.
CP437 correctly identifies many of the engineering symbols, such as [±] "plus minus," [º] degree," but fails on "diameter"
PRT files actually store the diameter symbol as three characters, with the second one always being [φ] "lowercase phi"
While not identical, the Nordic [Ø] "O with slash" is often accepted as the diameter symbol. 

You may find a more elegant solution looking at [http://en.wikipedia.org/wiki/Code_page_437]
I've simply been substituting.


String str = new String(text, 0, text.length, "Cp437");
str = str.replace("\u03C6","\u00D8");

I've attached a patch to (r1143194)


      was (Author: runamok81):
    I've narrowed the encoding down to CP437.
CP437 correctly identifies many of the engineering symbols, such as [±] "plus minus," [º] degree," but fails on "diameter"
PRT files actually store the diameter symbol as three characters, with the second one always being [φ] "lowercase phi"
While not identical, the Nordic [Ø] "O with slash" is often accepted as the diameter symbol. 

You may find a more elegant solution looking at [http://en.wikipedia.org/wiki/Code_page_437]
I've simply been substituting.


String str = new String(text, 0, text.length, "Cp437");
str = str.replace("\u03C6","\u00D8");

  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058065#comment-13058065 ] 

Troy Witthoeft edited comment on TIKA-679 at 6/30/11 8:45 PM:
--------------------------------------------------------------

The biggest difficulty in this code is selecting the correct prefix.
Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø] 


      was (Author: runamok81):
    The biggest difficulty in this code is selecting the correct prefix.
Not sure on the encoding, but UTF-8 seems to be working well
Currently, this parser will get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø] 

  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
> I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
> Here are my findings.
> The file header contains, a magic mime type, file creation date, and file description.
> The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
> If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
> If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]
> The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.
> The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.
> GUIDE 
> [3#][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]
> EXAMPLE
> [33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA
> Any pointers on how to improve the code is appreciated.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064134#comment-13064134 ] 

Troy Witthoeft commented on TIKA-679:
-------------------------------------

The blackfeather extracted garbage text is a result of lax byte matching
rules on handleViewName.
Adjusting the code to byte match on E3 3F plus your last5 is33 check
produces perfect output.
We may need to discuss which information inside a PRT file is needed.

Certain drawings, like the blackfeather example, are a parent drawing
composed of smaller child drawings.
These child drawings include note entities that are absorbed but not
displayed in the parent.  These "remnant" note entities are not accessible,
and they are not editable.
You will notice that PRTParser extracts the child note entities.  The
blackfeather example produces multiple dates.  While the only visible date
to CAD engineer would be 4/16/02.

Commercial extractors do not behave this way.  They simply treat the
drawings like an OCR'd document, and only extract the visible text.  I'm
confident there is byte pattern to distinguish the two.
I cannot think of a feasible use for ViewNames or hidden child note
entities.  However, It seems wasteful to remove that functionality, since we
currently have created a way to extract them.
Should we breakout viewnames and childnote text entries as separate
metadata?

Any input is appreciated.
















> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058065#comment-13058065 ] 

Troy Witthoeft edited comment on TIKA-679 at 6/30/11 9:00 PM:
--------------------------------------------------------------

Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø]

The biggest difficulty in this code is selecting the correct prefix to search for.
The prefix DOES have a pattern.  For instance, note the green byte that always represent the length of the text.

PREFIX ROUGH GUIDE
[3#][33][33][33][33][33]{color:blue}[E3][3F]{color}[0#][00][00][0#][00][00][0#][0#][0#][1F]{color:green}[LN]{color}[00]{color:red}[USERINPUT TEXT]{color}[00][xx]

EXAMPLE
[33][33][33][33][33][33]{color:blue}[E3][3F]{color}[00][00][00][00][00][00][00][02][01][1F]{color:green}[05]{color}[00]{color:red}[54][49][4B][41]{color}[00][0B] .... {color:red}TIKA{color}


Once we narrow down text detection, we can move on to extracting the date, and file description.

NOTE: Magic Mime type on this file <match value="0M3C" type="string" offset="8" />


      was (Author: runamok81):
    The biggest difficulty in this code is selecting the correct prefix to search for.
Different methods, and modifiers of text create different prefixes, but there is a pattern.

PREFIX ROUGH GUIDE
[3#]{color:blue}[33][33][33][33][33][E3][3F]{color}[0#][00][00][0#][00][00][0#][0#][0#][1F]{color:blue}[LN]{color}[00]{color:red}[USERINPUT TEXT]{color}[00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA



I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
Here are my findings.

The file header contains, a magic mime type, file creation date, and file description.
The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]

The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.

The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.

GUIDE 
[3#][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA

Any pointers on how to improve the code is appreciated.




Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø] 

  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Comment: was deleted

(was: This file has more advanced text entries.)

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Attachment: PRTParser.patch
                PRTParserTest.patch

The attached patches strip view name parsing, and the associated testing.

This makes the parser output more stable.




> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, PRTParser.patch, PRTParserTest.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059639#comment-13059639 ] 

Troy Witthoeft commented on TIKA-679:
-------------------------------------

I've narrowed the encoding down to CP437.
CP437 correctly identifies many of the engineering symbols, such as [±] "plus minus," [º] degree," but fails on "diameter"
PRT files actually store the diameter symbol as three characters, with the second one always being [φ] "lowercase phi"
While not identical, the Nordic [Ø] "O with slash" is often accepted as the diameter symbol. 

You may find a more elegant solution looking at [http://en.wikipedia.org/wiki/Code_page_437]
I've simply been substituting.

[code]
String str = new String(text, 0, text.length, "Cp437");
str = str.replace("\u03C6","\u00D8");
[/code]

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Attachment: PRTParser.patch

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Attachment: TikaTest.prt

Can you find the text inside?

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>    Affects Versions: 0.9
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * It also currently sets some dummy metadata.
>  */
> public class PRTParser implements Parser {
>         private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>         public static final String PRT_MIME_TYPE = "application/prt";
>         
> 		
>         public Set<MediaType> getSupportedTypes(ParseContext context) {
>                 return SUPPORTED_TYPES;
>         }
> 		
>         public void parse(
>                         InputStream stream, ContentHandler handler,
>                         Metadata metadata, ParseContext context)
>                         throws IOException, SAXException, TikaException {
> 						byte[] prefix = new byte[] {0x01, 0x1F};  //
> 						int pos = 0;
> 						int read;
> 						while( (read = stream.read()) > -1) {   //  Reads the next single byte of data (returns byte pos) until you hit the EOF
> 						  if(read == prefix[pos]) {				//  If the byte being read is 
> 							pos++;
> 							if(pos == prefix.length) {
> 							  // found it!
> 							  int length = stream.read();
> 							  int unknown = stream.read();
> 							  byte[] text = new byte[length];
> 							  IOUtils.readFully(stream, text);      //reads a selected byte array from the InputStream
> 							  // turn it into a string, removing null termination
> 							  // assumes it's found to be utf-8
> 							  String str = new String(text, 0, text.length, "UTF-8");
> 							  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 							  xhtml.startElement("p");	
> 							  xhtml.characters(str);
> 							  xhtml.endElement("p");
> 							pos--;  
> 							}
> 						  } else {
> 							pos = 0;
> 						  }
> 						}
>         }
>         /**
>          * @deprecated This method will be removed in Apache Tika 1.0.
>          */
>         public void parse(
>                         InputStream stream, ContentHandler handler, Metadata metadata)
>                         throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>         }
> }
> {code}   
> I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
> Here are my findings.
> The file header contains, a magic mime type, file creation date, and file description.
> The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
> If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
> If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]
> The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.
> The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.
> GUIDE 
> [33][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]
> EXAMPLE
> [33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA
> Any pointers on how to improve the code is appreciated.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059640#comment-13059640 ] 

Troy Witthoeft edited comment on TIKA-679 at 7/5/11 6:39 PM:
-------------------------------------------------------------

Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt

*Update:*  
I checked out the latest SVN (1143118).  Unfortunately, I was unable to get the tika-app-1.0-SNAPSHOT gui to extract text from prt files.
I suppose the new rules may be a little strict.  Can you test with my new file?
I have added a new more complex TikaTest2.prt
I also added a text file that contains all the expected text inside TikaTest2.prt

  

      was (Author: runamok81):
    Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt



  
  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Attachment: TikaTest2.prt.txt

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059640#comment-13059640 ] 

Troy Witthoeft edited comment on TIKA-679 at 7/5/11 9:06 PM:
-------------------------------------------------------------

Nick,

The new PRTParser.java works very well! 
I'm having a hard time find example prt files from my collection of various industries to stump the parser, so I had to go outside my collection.
I did find one file that fails.  I found it through a Google search of "filetype:prt 0M3C"
[http://www.blackfeathermedia.com/etcher/Appendix_A_-_Technical_Drawings/EscherSketcher_Rev05.prt]

I think the next step is setting the metadata for the file description.
The file description is user-editable.  It's a maximum of 500 characters stored in the bytes that immediately follow the creation date.  


  

      was (Author: runamok81):
    Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt

*Update:*  
Can you test with my new file, TikaTest2.prt?
I also added a text file that contains all the expected text inside TikaTest2.prt

You can find more examples with a google search "filetype:prt 0M3C"

  
  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059640#comment-13059640 ] 

Troy Witthoeft edited comment on TIKA-679 at 7/5/11 8:08 PM:
-------------------------------------------------------------

Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt

*Update:*  
Can you test with my new file, TikaTest2.prt?
I also added a text file that contains all the expected text inside TikaTest2.prt

You can find more examples with a google search "filetype:prt 0M3C"

  

      was (Author: runamok81):
    Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt

*Update:*  
I checked out the latest SVN (1143118).  Unfortunately, I was unable to get the tika-app-1.0-SNAPSHOT gui to extract text from prt files.
I suppose the new rules may be a little strict.  Can you test with my new file, TikaTest2.prt?
I also added a text file that contains all the expected text inside TikaTest2.prt

You can find more examples with a google search "filetype:prt 0M3C"

  
  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Description: 
It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.
Any assistance further developing this code is appreciated.


{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * Searches for specific byte prefix, and outputs text from note entities
 * Does not support special DRAFT-PAK characters.
 */

public class PRTParser implements Parser {

    private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
    public static final String PRT_MIME_TYPE = "application/prt";
        	
    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
        }
		
    public void parse(
		InputStream stream, ContentHandler handler,
		Metadata metadata, ParseContext context)
		throws IOException, SAXException, TikaException {
		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
		int pos = 0;											
		int read;
		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
			pos++;													
				if(pos == prefix.length) {								
					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
					stream.skip(1);											
					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
					IOUtils.readFully(stream, text);						
					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
					xhtml.startElement("p");	
					xhtml.characters(str);
					xhtml.endElement("p");
					pos--;  
				}
			} 
			else {
				//Did not find the prefix. Reset the position counter.
				pos = 0;
			}
		}
	}


		
	/**
    * @deprecated This method will be removed in Apache Tika 1.0.
    */
    public void parse(
                   InputStream stream, ContentHandler handler, Metadata metadata)
                   throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
    }
}{code}   

 







  was:
It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.

{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * Searches for specific byte prefix, and outputs text from note entities
 * Does not support special DRAFT-PAK characters.
 */

public class PRTParser implements Parser {

    private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
    public static final String PRT_MIME_TYPE = "application/prt";
        	
    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
        }
		
    public void parse(
		InputStream stream, ContentHandler handler,
		Metadata metadata, ParseContext context)
		throws IOException, SAXException, TikaException {
		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
		int pos = 0;											
		int read;
		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
			pos++;													
				if(pos == prefix.length) {								
					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
					stream.skip(1);											
					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
					IOUtils.readFully(stream, text);						
					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
					xhtml.startElement("p");	
					xhtml.characters(str);
					xhtml.endElement("p");
					pos--;  
				}
			} 
			else {
				//Did not find the prefix. Reset the position counter.
				pos = 0;
			}
		}
	}


		
	/**
    * @deprecated This method will be removed in Apache Tika 1.0.
    */
    public void parse(
                   InputStream stream, ContentHandler handler, Metadata metadata)
                   throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
    }
}{code}   

I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
Here are my findings.

The file header contains, a magic mime type, file creation date, and file description.
The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]

The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.

The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.

GUIDE 
[3#][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA

Any pointers on how to improve the code is appreciated.
 








> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059601#comment-13059601 ] 

Nick Burch commented on TIKA-679:
---------------------------------

I've added the detector part in r1142795, thanks for the file and the match line

For the special characters you mention, which text entry in the file contains them? And where?

I'll start taking a look at the parser shortly

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059640#comment-13059640 ] 

Troy Witthoeft edited comment on TIKA-679 at 7/5/11 7:22 PM:
-------------------------------------------------------------

Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt

*Update:*  
I checked out the latest SVN (1143118).  Unfortunately, I was unable to get the tika-app-1.0-SNAPSHOT gui to extract text from prt files.
I suppose the new rules may be a little strict.  Can you test with my new file, TikaTest2.prt?
I also added a text file that contains all the expected text inside TikaTest2.prt

You can find more examples with a google search "filetype:prt 0M3C"

  

      was (Author: runamok81):
    Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt

*Update:*  
I checked out the latest SVN (1143118).  Unfortunately, I was unable to get the tika-app-1.0-SNAPSHOT gui to extract text from prt files.
I suppose the new rules may be a little strict.  Can you test with my new file?
I have added a new more complex TikaTest2.prt
I also added a text file that contains all the expected text inside TikaTest2.prt

  
  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061961#comment-13061961 ] 

Nick Burch commented on TIKA-679:
---------------------------------

Thanks, updated patch committed in r1144314 with a few tweaks

The blackfeather file does parse, it's just that it's so large that the default text handler size gets it. It does include quite a few bits of junk though, so it's possible the current matching rules are too lax. If you have time, please do look at some of the "junk" text we return from it and see what header bytes come first!

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058065#comment-13058065 ] 

Troy Witthoeft edited comment on TIKA-679 at 6/30/11 8:51 PM:
--------------------------------------------------------------

The biggest difficulty in this code is selecting the correct prefix to search for.
Different methods, and modifiers of text create different prefixes, but there is a pattern.

PREFIX ROUGH GUIDE
[3#]{color:blue}[33][33][33][33][33][E3][3F]{color}[0#][00][00][0#][00][00][0#][0#][0#][1F]{color:blue}[LN]{color}[00]{color:red}[USERINPUT TEXT]{color}[00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA



I am looking for assistance in improving this code.  I am in the process of picking apart the prt file structure.
Here are my findings.

The file header contains, a magic mime type, file creation date, and file description.
The magic mime type can be identified with  <match value="0M3C" type="string" offset="8" />
If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes.  
If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8]

The goal is to extract the user entered text.  User text is marked by a prefix of 42 bytes.  Newest entries are at the top of the file.

The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null.

GUIDE 
[3#][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx]

EXAMPLE
[33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA

Any pointers on how to improve the code is appreciated.




Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø] 


      was (Author: runamok81):
    The biggest difficulty in this code is selecting the correct prefix.
Currently, the vague prefix allows this parser to get every instance of text, but it also picks up some garbage text.
Additionally, it cannot recognize special characters [±,°,Ø] 

  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064170#comment-13064170 ] 

Nick Burch commented on TIKA-679:
---------------------------------

> The blackfeather extracted garbage text is a result of lax byte matching rules on handleViewName.
> Adjusting the code to byte match on E3 3F plus your last5 is33 check produces perfect output.

Quite a few of the view names in your sample files had 5*00 before them, rather than 5*33. Which rule is the blackfeather one incorrectly triggering? And what bit in the text types comment have we got wrong?

For the view names vs text, maybe we should put different classes on them or something like that?

In terms of the hidden text, I suspect we'll need to understand the file format better first! The initial trick would probably be to look at your sample files, and see if we can figure out the rules for getting from the start of the file to the first view entry. Do we always need to skip a fixed distance? Can we start somewhere and read some IDs then some length, skip and find the next IDs then length etc? Is there something that tells us if the text will be "f0 3b 8*00 sz sz text" or "f0 3b sz sz text"? In your first sample file, why do most of the views have zeros before their f0 3f/bf but the Isometric one has data until it's.

Once we can answer at least some of those, we'll be more along the way of cracking the format's structure!

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065559#comment-13065559 ] 

Troy Witthoeft commented on TIKA-679:
-------------------------------------


>Quite a few of the view names in your sample files had 5*00 before them, rather than 5*33. Which rule is the blackfeather one incorrectly triggering? And what bit in the text types comment have we got wrong?  For the view names vs text, maybe we should put different classes on them or something like that?

I'm not suggesting any changes to the byte rules on view detection (yet). But, I just don't see the payoff in detecting view names. It's tough to pin down the prefixes on the last views. I think simply detecting note text, date, description, is sufficient.
  
>I suspect we'll need to understand the file format better first!

I've compared my collection of files using www.fairdell.com
The first 8,000 bytes of PRT files are fairly static.  
Bytes don't vary dramatically.  Specifically, the first 8 views are evenly spaced, and always in the same location, in every PRT file.  

>see if we can figure out the rules for getting from the start of the file to the first view entry.

For instance, you will always find the text for the first view, "Top View," is always at 7031st Byte or 0x01b77.  

> Once we can answer at least some of those, we'll be more along the way of cracking the format's structure!

The Top View is followed by Front, Back, Bottom, Right, Left, Isometric, and Axonometric Views which are always in the same location.
No need for byte pattern matching, we know exactly where they are.
Next, as we approach the 8Kb mark, System view may be present multiple times, or not at all.  That is were it starts to get dicey.
For instance, ultra complex blackfeather has 240 system views with different prefixes!
But we can determine where the System View section ends because it is always suffixed by what I call the "ascending byte matrix marker"
It always starts with the same few bytes..

[FF FF 10 00 00 00 11 .. .. .. 12 .. .. .. .. 13 ..]
[.. .. 14 .. .. .. 15 .. .. .. 16 .. .. .. .. 17 ..]
[.. .. 18 .. .. .. 19 .. .. .. 1A .. .. .. 1B .. ]

Immediately Following the "ascending byte matrix" 
is a section dedicated to whichever CAD programs have touched the file.
You should always find the CADKEY identifier "CK_TTFTABLE"

Does this help?
 

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059623#comment-13059623 ] 

Nick Burch commented on TIKA-679:
---------------------------------

I've committed a first stab at a PRT parser in r1142817, inspired by your work. It's able to get most view names, and all the text

My hunch is that the file is record based, with e0/e2/e3/f0 3f/bf being the type marker. This then seems to be followed by the size / 8 mostly zero bytes then size / 8 mostly zero bytes + another type + size. I can't guess enough to figure it out though, so I've gone for a largely brute-force pattern matching approach instead

Can you try with all your files, and report back if the matching rules are too strict / not strict enough?

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059639#comment-13059639 ] 

Troy Witthoeft edited comment on TIKA-679 at 7/4/11 11:49 PM:
--------------------------------------------------------------

I've narrowed the encoding down to CP437.
CP437 correctly identifies many of the engineering symbols, such as [±] "plus minus," [º] degree," but fails on "diameter"
PRT files actually store the diameter symbol as three characters, with the second one always being [φ] "lowercase phi"
While not identical, the Nordic [Ø] "O with slash" is often accepted as the diameter symbol. 

You may find a more elegant solution looking at [http://en.wikipedia.org/wiki/Code_page_437]
I've simply been substituting.


String str = new String(text, 0, text.length, "Cp437");
str = str.replace("\u03C6","\u00D8");


      was (Author: runamok81):
    I've narrowed the encoding down to CP437.
CP437 correctly identifies many of the engineering symbols, such as [±] "plus minus," [º] degree," but fails on "diameter"
PRT files actually store the diameter symbol as three characters, with the second one always being [φ] "lowercase phi"
While not identical, the Nordic [Ø] "O with slash" is often accepted as the diameter symbol. 

You may find a more elegant solution looking at [http://en.wikipedia.org/wiki/Code_page_437]
I've simply been substituting.

[code]
String str = new String(text, 0, text.length, "Cp437");
str = str.replace("\u03C6","\u00D8");
[/code]
  
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Attachment: PRTParser.patch

Added support for correct encoding, special character recognition, and description metadata.  Please see patch file.
  

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059640#comment-13059640 ] 

Troy Witthoeft commented on TIKA-679:
-------------------------------------

Nick,

Tomorrow, I will test your code with a large set of files, and report back.
I will also see if I can get an engineer buddy to create a more thorough TikaTest.prt



  

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Description: 
It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.
Any assistance further developing this code is appreciated.


{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * Searches for specific byte prefix, and outputs text from note entities.
 * Does not support special characters.
 */
 

public class PRTParser implements Parser {

    private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
    public static final String PRT_MIME_TYPE = "application/prt";
        	
    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
        }
		
    public void parse(
		InputStream stream, ContentHandler handler,
		Metadata metadata, ParseContext context)
		throws IOException, SAXException, TikaException {
		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
		
		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
		int pos = 0;										//position inside the prefix
		int read;
		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
				pos++;													
					if(pos == prefix.length) {								//Are we at the last position of the prefix?
						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
						stream.skip(1);											
						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
						IOUtils.readFully(stream, text);						
						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
						metadata.add("Content",str);
						xhtml.startElement("p");	
						xhtml.characters(str);
						xhtml.endElement("p");
						pos = 0; 							
					}
			} 
			else {
				//Did not find the prefix. Reset the position counter.
				pos = 0;
			}
		}
	//Reached the end of file
	//System.out.println("Finished searching the file");	
	}


		
	/**
    * @deprecated This method will be removed in Apache Tika 1.0.
    */
    public void parse(
                   InputStream stream, ContentHandler handler, Metadata metadata)
                   throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
    }
}{code}   

 







  was:
It would be nice if Tika had support for prt CAD files.
A preliminary prt text extractor has been created.
Any assistance further developing this code is appreciated.


{code:title=PRTParser.java|borderStyle=solid}
package org.apache.tika.parser.prt;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.Set;

import org.apache.poi.util.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
 * Searches for specific byte prefix, and outputs text from note entities
 * Does not support special DRAFT-PAK characters.
 */

public class PRTParser implements Parser {

    private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
    public static final String PRT_MIME_TYPE = "application/prt";
        	
    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
        }
		
    public void parse(
		InputStream stream, ContentHandler handler,
		Metadata metadata, ParseContext context)
		throws IOException, SAXException, TikaException {
		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
		int pos = 0;											
		int read;
		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
			pos++;													
				if(pos == prefix.length) {								
					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
					stream.skip(1);											
					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
					IOUtils.readFully(stream, text);						
					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
					xhtml.startElement("p");	
					xhtml.characters(str);
					xhtml.endElement("p");
					pos--;  
				}
			} 
			else {
				//Did not find the prefix. Reset the position counter.
				pos = 0;
			}
		}
	}


		
	/**
    * @deprecated This method will be removed in Apache Tika 1.0.
    */
    public void parse(
                   InputStream stream, ContentHandler handler, Metadata metadata)
                   throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
    }
}{code}   

 








> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (Closed) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft closed TIKA-679.
-------------------------------

    Resolution: Fixed
    
> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, PRTParser.patch, PRTParserTest.patch, TikaTest.prt, TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;										//position inside the prefix
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 				pos++;													
> 					if(pos == prefix.length) {								//Are we at the last position of the prefix?
> 						stream.skip(11);										// skip the 11 bytes of the prefix which can vary.
> 						int lengthbyte = stream.read();							// Set the next byte equal to the length of text in the user input field, see PRT schema
> 						stream.skip(1);											
> 						byte[] text = new byte[lengthbyte];						// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 						IOUtils.readFully(stream, text);						
> 						String str = new String(text, 0, text.length, "Cp437");	// Cp437 turn it into a string, but does not remove null termination, assumes it's found to be MS-DOS Encoding
> 						str = str.replace("\u03C6","\u00D8");					// Note: Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent diameter symbol. 
> 						metadata.add("Content",str);
> 						xhtml.startElement("p");	
> 						xhtml.characters(str);
> 						xhtml.endElement("p");
> 						pos = 0; 							
> 					}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	//Reached the end of file
> 	//System.out.println("Finished searching the file");	
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-679) Proposal for PRT Parser

Posted by "Troy Witthoeft (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Troy Witthoeft updated TIKA-679:
--------------------------------

    Attachment: TikaTest2.prt

This file has more advanced text entries.

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: TikaTest.prt, TikaTest2.prt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities
>  * Does not support special DRAFT-PAK characters.
>  */
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>         	
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
> 		
>     public void parse(
> 		InputStream stream, ContentHandler handler,
> 		Metadata metadata, ParseContext context)
> 		throws IOException, SAXException, TikaException {
> 		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
> 		int[] prefix = new int[] {227, 63};  				//Looking for a prefix set of bytes {E3, 3F} 
> 		int pos = 0;											
> 		int read;
> 		while( (read = stream.read()) > -1) {					// stream.read() moves to the next byte, and returns an integer value of the byte.  a value of -1 signals the EOF
> 			if(read == prefix[pos]) {								// is the last byte read the same as the first byte in the prefix?
> 			pos++;													
> 				if(pos == prefix.length) {								
> 					stream.skip(11);										// skip the 13 bytes of the prefix which can vary.
> 					int length = stream.read();								// Set the next byte equal to the length of text in the user input field, see PRT schema
> 					stream.skip(1);											
> 					byte[] text = new byte[length];							// a new byte array called text is created.  It should contain an array of integer values of the user inputted text.
> 					IOUtils.readFully(stream, text);						
> 					String str = new String(text, 0, text.length, "UTF-8");	// turn it into a string, but does not remove null termination, assumes it's found to be utf-8
> 					xhtml.startElement("p");	
> 					xhtml.characters(str);
> 					xhtml.endElement("p");
> 					pos--;  
> 				}
> 			} 
> 			else {
> 				//Did not find the prefix. Reset the position counter.
> 				pos = 0;
> 			}
> 		}
> 	}
> 		
> 	/**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira