You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Dave Fisher <da...@comcast.net> on 2018/10/15 01:50:43 UTC

List of Slides for my China talk this coming weekend

Hi -

I’ve come with the plan for my POI talk next weekend. I need to finalize my slides tomorrow so that some Chinese translation can be done. I have some questions that I’ll mark as “—>”. If you can answer you’ll save me some research.

I plan to tell the story of POI, including Tika interactions, and Common Crawler, in the end I want to give people two places to contribute along with motivation.

(1) Title
	Name of presentation
	About Dave
(2) POI
	When it started in Jakarta the simple use case.
	End of Jakarta
(3) OOXML and the Microsoft Open Specification Promise
	The OSP
	The flame war
	OpenXML4J - http://incubator.apache.org/ip-clearance/openxml4j.html <http://incubator.apache.org/ip-clearance/openxml4j.html>
	XSSF, XSLF, and SS
(4) Tika and OOXML lite
	Apachecon Oakland 2009 - Jukka asked Nick, Yegor and I during BarCamp if we could something about the 13MB ooxml jar. Yegor came up with a solution in a day. 
	Unit Test and your Beans are included
	—> Anyone: anything to add? XMLBeans impacts?
(5) Graphics2D
	Discuss output techniques developed.
	—> Yegor - is there some sample code you might share.
(6) Tika Text Extraction
	—> Could use pointers to the basic tutorial.
(7) Common Crawler - 1TB of samples
	Common Crawler - commoncrawl.org
	Common Crawler Download - centic9
	Regression sets for POI, Tika and PDFBox
	—> Are there other Apache projects that use these documents?
(8) The POI Toolbox
	A table of the various formats with input, output, and remarks.
(9) XMLBeans 3
	Bringing the product out of the attic.
	—> Any reasons besides better control of Entity Expansion attacks?
(10) Contributing to POI and Tika Will Improve Your Solr Search Results
	How Solr and similar architectures depend on Tika and Tika depends on POI
	Example is Headers and Footers choices on Word documents on the Tika List this past week.

Thanks for your help and feedback!

Regards,
Dave



Re: List of Slides for my China talk this coming weekend

Posted by Dominik Stadler <do...@gmx.at>.
Nice overview, would be interesting to watch, especially the slides about
the "old days"!

Ad (9) there were a few other bugs that popped up regularly, unicode
handling, duplicate classes which forced us to unattic it
Ad (7) results for POI regression tests are at
http://people.apache.org/~centic/poi_regression/reports/ if you want to add
a link

Dominik


On Mon, Oct 15, 2018, 03:50 Dave Fisher <da...@comcast.net> wrote:

> Hi -
>
> I’ve come with the plan for my POI talk next weekend. I need to finalize
> my slides tomorrow so that some Chinese translation can be done. I have
> some questions that I’ll mark as “—>”. If you can answer you’ll save me
> some research.
>
> I plan to tell the story of POI, including Tika interactions, and Common
> Crawler, in the end I want to give people two places to contribute along
> with motivation.
>
> (1) Title
>         Name of presentation
>         About Dave
> (2) POI
>         When it started in Jakarta the simple use case.
>         End of Jakarta
> (3) OOXML and the Microsoft Open Specification Promise
>         The OSP
>         The flame war
>         OpenXML4J -
> http://incubator.apache.org/ip-clearance/openxml4j.html <
> http://incubator.apache.org/ip-clearance/openxml4j.html>
>         XSSF, XSLF, and SS
> (4) Tika and OOXML lite
>         Apachecon Oakland 2009 - Jukka asked Nick, Yegor and I during
> BarCamp if we could something about the 13MB ooxml jar. Yegor came up with
> a solution in a day.
>         Unit Test and your Beans are included
>         —> Anyone: anything to add? XMLBeans impacts?
> (5) Graphics2D
>         Discuss output techniques developed.
>         —> Yegor - is there some sample code you might share.
> (6) Tika Text Extraction
>         —> Could use pointers to the basic tutorial.
> (7) Common Crawler - 1TB of samples
>         Common Crawler - commoncrawl.org
>         Common Crawler Download - centic9
>         Regression sets for POI, Tika and PDFBox
>         —> Are there other Apache projects that use these documents?
> (8) The POI Toolbox
>         A table of the various formats with input, output, and remarks.
> (9) XMLBeans 3
>         Bringing the product out of the attic.
>         —> Any reasons besides better control of Entity Expansion attacks?
> (10) Contributing to POI and Tika Will Improve Your Solr Search Results
>         How Solr and similar architectures depend on Tika and Tika depends
> on POI
>         Example is Headers and Footers choices on Word documents on the
> Tika List this past week.
>
> Thanks for your help and feedback!
>
> Regards,
> Dave
>
>
>

Re: List of Slides for my China talk this coming weekend

Posted by Yegor Kozlov <ye...@dinom.ru>.
Hi Dave,

(2) POI

>         When it started in Jakarta the simple use case.
>         End of Jakarta
>

It's worth mentioning how hard it was to develop the APIs for the binary
formats with very little or no documentation. You can say that the letter
'H' in HSSF stands for 'horrible'  and that the early work involved a lot
of guessing and reverse engineering.



> (3) OOXML and the Microsoft Open Specification Promise
>         The OSP
>         The flame war
>         OpenXML4J -
> http://incubator.apache.org/ip-clearance/openxml4j.html <
> http://incubator.apache.org/ip-clearance/openxml4j.html>
>         XSSF, XSLF, and SS
> (4) Tika and OOXML lite
>         Apachecon Oakland 2009 - Jukka asked Nick, Yegor and I during
> BarCamp if we could something about the 13MB ooxml jar. Yegor came up with
> a solution in a day.
>         Unit Test and your Beans are included
>         —> Anyone: anything to add? XMLBeans impacts?
> (5) Graphics2D
>         Discuss output techniques developed.
>         —> Yegor - is there some sample code you might share.
>

We have a good collection of examples at
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi



> (6) Tika Text Extraction
>         —> Could use pointers to the basic tutorial.
>

Say that Tika is a de-facto standard for extracting text in the Java world.
Every time a Java project extracts text from a MS Office file, it does it
through Tika and POI. Solr, Jackrabbit and Nutch are examples.


> (7) Common Crawler - 1TB of samples
>         Common Crawler - commoncrawl.org
>         Common Crawler Download - centic9
>         Regression sets for POI, Tika and PDFBox
>         —> Are there other Apache projects that use these documents?
>
(8) The POI Toolbox
>         A table of the various formats with input, output, and remarks.
>

Give a quick overview of the supported features. Excel, PowerPoint and Word
are the "big three" that are the most mature.
To manipulate the formats we provide a la DOM APIs that construct  a tree
of objects in memory .
To extract data we provide single pass, a la SAX parsers which lower memory
footprint.
Show the how-to code snippets from the POI site.
Mention that POI can evaluate Excel formulas .

(9) XMLBeans 3
>         Bringing the product out of the attic.
>         —> Any reasons besides better control of Entity Expansion attacks?
> (10) Contributing to POI and Tika Will Improve Your Solr Search Results
>         How Solr and similar architectures depend on Tika and Tika depends
> on POI
>         Example is Headers and Footers choices on Word documents on the
> Tika List this past week.
>
>
It might be worth mentioning the Panama Papers story, when the information
from the leaked documents was extracted using Tika. If Tika and POI didn't
exist it would have taken years to process these files. With Tika it was a
matter of hours.

Yegor

>
>