You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Keith R. Bennett (JIRA)" <ji...@apache.org> on 2007/09/13 20:19:32 UTC
[jira] Created: (TIKA-16) Issues with data files used for testing
by TestParsers.
Issues with data files used for testing by TestParsers.
-------------------------------------------------------
Key: TIKA-16
URL: https://issues.apache.org/jira/browse/TIKA-16
Project: Tika
Issue Type: Bug
Reporter: Keith R. Bennett
The TestParsers class requires that the following test files be available:
testPDF.PDF
testTXT.TXT
testRTF.RTF
textXML.XML
testPPT.PPT
testWORD.doc
testEXCEL.xls
testOO2.odt
testHTML.html
Only the following are provided:
testHTML.html
testRTF.rtf
testTXT.txt
testXML.xml
Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith R. Bennett updated TIKA-16:
---------------------------------
Attachment: testPPT.PPT
This is a sample Powerpoint file that has a single slide saying:
----
Sample Powerpoint Slide
Created with Microsoft Powerpoint X for Mac Service Release 1
----
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Attachments: testEXCEL.xls, testOO2.odt, testPDF.PDF, testPPT.PPT
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith R. Bennett updated TIKA-16:
---------------------------------
Attachment: testWORD.doc
Added sample word file which contains:
----
Sample Word .doc File
Created with
Microsoft Word X for Mac, Service Release 1
----
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Attachments: testEXCEL.xls, testOO2.odt, testPDF.PDF, testPPT.PPT, testWORD.doc
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith R. Bennett updated TIKA-16:
---------------------------------
Attachment: testPDF.PDF
Here is a sample PDF file I found in a nutch directory.
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Attachments: testPDF.PDF
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bertrand Delacretaz resolved TIKA-16.
-------------------------------------
Resolution: Fixed
Fixed in revision 575896, I have moved all the test documents to src/test/resources/test-documents and added the supplied new ones.
For testPDF.pdf I used a different file as yours contained a person's name.
Thanks for your contribution!
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Assignee: Bertrand Delacretaz
> Attachments: testEXCEL.xls, testOO2.odt, testPDF.PDF, testPPT.PPT, testWORD.doc
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith R. Bennett updated TIKA-16:
---------------------------------
Attachment: testEXCEL.xls
Excel test data file created with Microsoft Excel X for Mac Service Release 1.
Anyone feel free to replace any of these files with better ones -- I'm just putting minimal ones here so that we have something at all.
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Attachments: testEXCEL.xls, testOO2.odt, testPDF.PDF
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith R. Bennett updated TIKA-16:
---------------------------------
Attachment: testOO2.odt
Here is a sample Open Office document. It contains the text:
This is a sample Open Office document, written in NeoOffice 2.2.1 for the Mac.
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Attachments: testOO2.odt, testPDF.PDF
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (TIKA-16) Issues with data files used for testing
by TestParsers.
Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bertrand Delacretaz reassigned TIKA-16:
---------------------------------------
Assignee: Bertrand Delacretaz
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Assignee: Bertrand Delacretaz
> Attachments: testEXCEL.xls, testOO2.odt, testPDF.PDF, testPPT.PPT, testWORD.doc
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files themselves, we should make sure that they are equal case-sensitively. My personal preference is for lower case extensions, but in any case they should be consistent. Therefore, where necessary, we need to rename the file or change the source code so that they match. (This is the case for *.rtf, *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the test does not fail. A minimal file would suffice for the short term, but ultimately it would be nice to have files that exercise the parsers to the fullest.
> Issue #3: We need to agree on a directory. The source code currently specifies that they should be in src/test/resources/testFiles, but in the respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.