You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Dave Meikle <lo...@gmail.com> on 2008/11/24 22:24:13 UTC

Tika 0.2 Release

Dear All,
The Tika project is looking to release Tika 0.2 now that we have graduated
from Incubation. As such I had prepared a release candidate and called a
vote but I believe I should have CC'd in the private list here - apologises
for not doing this.

The thread following my original email on Tika Dev can be found here:
http://markmail.org/message/arqiobnduu6bwine

Regards,
Dave Meikle

Re: Tika 0.2 Release

Posted by Dave Meikle <lo...@gmail.com>.
Hi Grant,
Thanks for the feedback.

2008/11/28 Grant Ingersoll <gs...@apache.org>

> +0.5.  It could use a little more getting started documentation.  Perhaps
> the README could simply point to a getting started page on the Wiki or
> http://lucene.apache.org/tika/gettingstarted.html or something like that.
>  Or, maybe an example of the "mvn copy-dependencies" (I think) command, that
> shows people how to at least get the dependencies and put it into a useable
> place (for non Maven users) , plus Tika.


This can all be added.


>
> Also, what about binary releases?


We have binary releases. They can be found in the repository included as
part of the release. You can skip to them using the URL below:
http://people.apache.org/~dmeikle/tika-0.2-rc1/repos/org/apache/tika/tika/0.2/

Cheers,
Dave

Re: Tika 0.2 Release

Posted by Grant Ingersoll <gs...@apache.org>.
+0.5.  It could use a little more getting started documentation.   
Perhaps the README could simply point to a getting started page on the  
Wiki or http://lucene.apache.org/tika/gettingstarted.html or something  
like that.  Or, maybe an example of the "mvn copy-dependencies" (I  
think) command, that shows people how to at least get the dependencies  
and put it into a useable place (for non Maven users) , plus Tika.

Also, what about binary releases?

I don't know if any of these are a show stopper or not.  I tend to  
think not, except maybe the need for some more docs in the actual  
package.

Cheers,
Grant



On Nov 24, 2008, at 4:24 PM, Dave Meikle wrote:

> Dear All,
>
> The Tika project is looking to release Tika 0.2 now that we have  
> graduated from Incubation. As such I had prepared a release  
> candidate and called a vote but I believe I should have CC'd in the  
> private list here - apologises for not doing this.
>
> The thread following my original email on Tika Dev can be found here:
> http://markmail.org/message/arqiobnduu6bwine
>
> Regards,
> Dave Meikle



Re: Tika 0.2 Release

Posted by Dave Meikle <lo...@gmail.com>.
Hi,

2008/12/2 Mattmann, Chris A <ch...@jpl.nasa.gov>
>
> > On Sun, Nov 30, 2008 at 11:07 PM, Dave Meikle <lo...@gmail.com> wrote:
> >> I think we should probably release the current trunk given the
> improvement
> >> to provide Java 1.4 support. I would be inclined to release this as 0.2
> >> given we haven't released anything so far.
> >
> > OK, that's fine by me.
>
> +1.


Based on this I will roll a release tomorrow evening based on trunk.

In the branch I will make the updates for the web site to include the
documentation we would like to publish - i.e. re-introduction of some of the
information Jukka has removed to avoid confusion.

Cheers,
Dave

Re: Tika 0.2 Release

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.
Hi Guys,


On 12/2/08 1:41 PM, "Jukka Zitting" <ju...@gmail.com> wrote:

> Hi,
>
> On Sun, Nov 30, 2008 at 11:07 PM, Dave Meikle <lo...@gmail.com> wrote:
>> I think we should probably release the current trunk given the improvement
>> to provide Java 1.4 support. I would be inclined to release this as 0.2
>> given we haven't released anything so far.
>
> OK, that's fine by me.

+1.

>
>> One thing about the naming of the release artefacts - whilst we can simply
>> use the apache-tika name for the source JAR or tarball, given that the
>> binary file is the result of a maven build the name will be tika-0.2.jar.
>
> I think that's fine. Having "apache" in the package name is nice but
> not really necessary. I'd just leave the binary file names as is.
>

+1. I think that it's important to stay consistent with the naming, and
since we released the original incubating src as "
apache-tika-0.1-incubating-src.tar.gz", I think we should do something
similar e.g., apache-tika-0.2-src.tar.gz. However, the jar file for
0.1-incubating was named tika-0.1-incubating.jar, so let's stay the same
with 0.2, e.g., tika-0.2.jar. So, long story short, +1 from me too.

Cheers,
Chris




> BR,
>
> Jukka Zitting
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: Tika 0.2 Release

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Nov 30, 2008 at 11:07 PM, Dave Meikle <lo...@gmail.com> wrote:
> I think we should probably release the current trunk given the improvement
> to provide Java 1.4 support. I would be inclined to release this as 0.2
> given we haven't released anything so far.

OK, that's fine by me.

> One thing about the naming of the release artefacts - whilst we can simply
> use the apache-tika name for the source JAR or tarball, given that the
> binary file is the result of a maven build the name will be tika-0.2.jar.

I think that's fine. Having "apache" in the package name is nice but
not really necessary. I'd just leave the binary file names as is.

BR,

Jukka Zitting

Re: Tika 0.2 Release

Posted by Dave Meikle <lo...@gmail.com>.
Hi,

2008/11/30 Jukka Zitting <ju...@gmail.com>
>
> > Based on your email though, it sounds like a few s/incubator/lucene/
> > changes and another paragraph in the README file is really all that needs
> > to be "fixed" to make 0.2 releasable (in my opinion anyway)
>
> We could do that as well. Dave, what's your preference?


I think we should probably release the current trunk given the improvement
to provide Java 1.4 support. I would be inclined to release this as 0.2
given we haven't released anything so far.

One thing about the naming of the release artefacts - whilst we can simply
use the apache-tika name for the source JAR or tarball, given that the
binary file is the result of a maven build the name will be tika-0.2.jar.

Normally adding apache- prefix doesn't matter as other projects are bundling
the generated jar within a tarball or zip generated by the assembly plugin,
so the name is simply on the outer packaging. In our case we would have to
add it in the pom which I don't particularly like and is non-standard.
Otherwise, it would be a rename which would result in a mismatch between the
file generated from source and our release, as well as manual work on
repository publishing.

In my eyes we could either a) stick using the name tika for source and
binary files or b) package them up as separate source and binary files using
assembly with the name apache-tika. B have the added bonus of allowing us to
package the documentation with the release - I also have a patch on ice.

Any thoughts?

Cheers,
Dave

Re: Tika 0.2 Release

Posted by Chris Hostetter <ho...@fucit.org>.
: Agreed. The reason why we still had those in the source was that the
: migration wasn't yet complete when the release was cut and the vote
: started. Now we're done with the migration, but the old references
: still work so it's IMHO not that big a deal. It's not like we have the
: Incubation disclaimers etc. in there.

I disagree .... the main URL for the project redirects so that's not 
really a problem, but all of the other incubator refrences are things you 
want to avoid pointing people at, particularly in a release...

1) the svn URLs no longer work

2) the mail archive URLs are all stale as of the moment of graduation, no 
new messages will show up.

3) the ezmlm addresses (tika-dev-help@incubator, etc...) trigger a 
MAILER-DAEMON failure notice -- the body of the message informs people 
that the list has moved to lucene.apache.org, but most people probably 
don't read the body of failure messages from MAILER-DAEMON anymore.


I attached a patch with some changes for the 0.2 branch to TIKA-178 -- if 
you guys think the changes all make sense, then i think that coveres all 
the concerns i had.


-Hoss


Re: Tika 0.2 Release

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sat, Nov 29, 2008 at 5:19 AM, Chris Hostetter
<ho...@fucit.org> wrote:
> I think the incubator ship has sailed: Tika has graduated, long live Tika!
> Any release after graduation shouldn't include incubator references.

Agreed. The reason why we still had those in the source was that the
migration wasn't yet complete when the release was cut and the vote
started. Now we're done with the migration, but the old references
still work so it's IMHO not that big a deal. It's not like we have the
Incubation disclaimers etc. in there.

> : now updated all Incubator references, so any new release will have
> : this issue fixed. Given the PMC pushback; perhaps we should just scrap
> : the 0.2 release and go directly to 0.3 based on the current trunk?
>
> I don't know enough about how the codelines have progressed to have an
> opinion on that, but I'm sure the next release can be *named* 0.2 even if
> you guys decide to abandom the current branch.

There are some nice post-0.2 improvements already in the trunk
(especially TIKA-175), so if we re-roll the release I'd start again
with the latest trunk. Whether we call the result "0.2" or "0.3" is
not that big a deal. I prefer "0.3" to avoid any confusion about
what's actually included in the release.

> Based on your email though, it sounds like a few s/incubator/lucene/
> changes and another paragraph in the README file is really all that needs
> to be "fixed" to make 0.2 releasable (in my opinion anyway)

We could do that as well. Dave, what's your preference?

> Ahhh..... see, my whole confusion was in thinking the docs have been
> applicable since 0.1 (I'm starting to really understand why Doug has
> argued so strongly in the past that the "main" set of docs a project has
> on the site should always be the last official release)

Yeah, that's a good point. Currently our documentation is still in
quite a flux, and I hope that the situation stabilizes at least by
1.0.

> By all means, ship with the documentation you have -- don't hold up a
> release waiting to write new docs -- but *something* should be in the
> README about how to "use" tika.  I've got a suggested path below.

OK, thanks! I committed the changes to trunk (see TIKA-177).

BR,

Jukka Zitting

Re: Tika 0.2 Release

Posted by Chris Hostetter <ho...@fucit.org>.
: I think it's fair to say that with the 0.2 release we're still pretty
: much in the transition for the Incubator to Lucene (and from a
: developer-only product to a general end user product). The main drive
	...
: made clearly either as an Incubator release or as a Lucene release
: once all the project migration is done. I guess I was the main
: proponent in pushing for the 0.2 release already while the Lucene
: migration was still incomplete.

I think the incubator ship has sailed: Tika has graduated, long live Tika! 
Any release after graduation shouldn't include incubator references.

: At least I was pretty vocal about switching to the jar format for our
	...
: tarball, at least I would rather fix the documentation than change the
: packaging format.

cool, as long as it was a conscious choice.

: now updated all Incubator references, so any new release will have
: this issue fixed. Given the PMC pushback; perhaps we should just scrap
: the 0.2 release and go directly to 0.3 based on the current trunk?

I don't know enough about how the codelines have progressed to have an 
opinion on that, but I'm sure the next release can be *named* 0.2 even if 
you guys decide to abandom the current branch.

Based on your email though, it sounds like a few s/incubator/lucene/ 
changes and another paragraph in the README file is really all that needs 
to be "fixed" to make 0.2 releasable (in my opinion anyway)

: documentation isn't complete (e.g. the Getting Started guide didn't
: yet exist in 0.2 release candidate) shouldn't IMHO be a blocker for a
: release (especially for a 0.x one). In any case it's an area where we
: are clearly getting better during the 0.x release cycle.
	...
: Currently the guide contains some forward-looking statements about the
: potentially upcoming 0.3 release; mostly that the "standalone" and

Ahhh..... see, my whole confusion was in thinking the docs have been 
applicable since 0.1 (I'm starting to really understand why Doug has 
argued so strongly in the past that the "main" set of docs a project has 
on the site should always be the last official release)

By all means, ship with the documentation you have -- don't hold up a 
release waiting to write new docs -- but *something* should be in the 
README about how to "use" tika.  I've got a suggested path below.

: Apache license header), so at least I prefer to not include the
: license header in those test files. See also
: http://markmail.org/message/m7jmgl3qncsffygb for related discussion on
: legal-discuss@.

Ah! ... that's awesome, i'm glad to see that the legal concensus seems to 
have changed -- ignore my comment.

: On the other hand, I don't see documentation as being a valid blocker
: for any 0.x release.

Hmmmm... holding up a release until docs get written is silly, but making 
sure people know how to find the docs that do exist (particularly when the 
ones they'll find if they go searching arround online are significantly 
different) seems important.


Suggested PATCH....

diff orig/README.txt hoss/README.txt
67a68,70
> This will created a ./target/ directory containing the Apache Tika
> binary JAR file.
> 
70a74,86
> Documentation
> =============
> 
> You can build a local copy of the Tika documentation including
> JavaDocs using the following Maven 2 command in the Tika source
> directory:
> 
>     mvn site
> 
> You can then open the Tika Documentation in a web browser:
> 
>     ./target/site/documentation.html
> 


Re: Tika 0.2 Release

Posted by Dave Meikle <lo...@gmail.com>.
Hi,

Thank Grant and Chris for the feedback, and Jukka for your comments.

2008/11/29 Jukka Zitting <ju...@gmail.com>

>
> > 1) release naming: should probably be apache-tika-0.2-src.jar  i seem to
> > recall someone somewhere saying that was important for apache releases
> > (and it's more consistent with the the 0.1 release)
>
> Good point, we probably should do that. Dave, can you take care of this?
>

I can sort this out.


> > 2) release file format: the 0.1 release seems to have been a tar.gz ...
> > was a concious choice made by the community to switch to distributing as
> a
> > src jar? otherwise you may want to publish both, or stick with tar.gz for
> > consistency (the docs on the website refer to the tarball when giving
> > examples of downloading and verifying)
>
> At least I was pretty vocal about switching to the jar format for our
> source releases, see most notably
> http://markmail.org/message/mwi4w2odztsxlcgi and
> http://markmail.org/message/jnthn2q4pghqxjlc. Unless the PMC prefers a
> tarball, at least I would rather fix the documentation than change the
> packaging format.
>

I agree with Jukka, but I am happy to add a tarball if required.


> > 3) incubator refs: as mentioned before, there are a lot of refrences to
> > the incubator that should be switched to point to lucene...
> >
> > hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ grep -lir incubator .
> > ./pom.xml
> > ./src/site/apt/download.apt
> > ./src/site/apt/index.apt
> > ./README.txt
>
> Fair point, and it goes with my statement above about getting the
> release out as soon as possible after graduation. In Tika trunk we've
> now updated all Incubator references, so any new release will have
> this issue fixed. Given the PMC pushback; perhaps we should just scrap
> the 0.2 release and go directly to 0.3 based on the current trunk?
>

If we were happy to release with 0.3 trunk, which I certainly am, I think
this would be best. Although I can just up date the 0.2 branch in line with
trunk if not.


> > 4) user docs: (I think grant may have already mentioned this) The
> > README.txt file talks about building Tika, but there doesn't seem to be
> > anything in the release that describes how to use Tika ... has any
> thought
> > been given to including more docs in the release it self? --
> > gettingstarted.html perhaps? ... at the very least a paragraph should be
> > added to the README refering to the gettingstarted.html page.
> >
> > Personally, i think including documentation.html and formats.html in the
> > release are also important -- they're going to change between releases,
> > probably more then the "getting started" type info, and should be
> > "versioned" so moving forward people with older versions won't get
> > misslead by the docs on the site.
>
> The available documentation is already included in the source release
> in src/site and can be generated with "mvn site". The fact that the
> documentation isn't complete (e.g. the Getting Started guide didn't
> yet exist in 0.2 release candidate) shouldn't IMHO be a blocker for a
> release (especially for a 0.x one). In any case it's an area where we
> are clearly getting better during the 0.x release cycle.
>
> The README could mention "mvn site" as the command to generate the
> official documentation for that release and we could include a static
> snapshot of that in http://lucene.apache.org/tika/ for reference. This
> is something we should look at.
>

In the future we could update our maven build to produce and add this
information in the binary and source releases, but for just now I think this
is a good approach.


> > 5) artifacts missing: i tried following along with the
> gettingstarted.html
> > (my first time using maven BTW so i may have messed something up) and ran
> > into a snag... "mvn install" download a bunch of dependencies (i think
> > they were maven's own dependencies since i'd never used it before), ran
> > some test (these definitely had tika in the name) then downloaded some
> > more things, then told me it was installing tika-0.2.jar in my ~/.m2
> > directory.  When i looked at the next section "Build artifacts" it
> refered
> > to 3 jars in my target directory -- but i only have one...
> >
> > hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ find target -name \*jar
> > target/tika-0.2.jar
> >
> > ...is the gettingstarted.html wrong, or did the build not run correctly?
>
>
As Jukka states, the 0.2 release was only meant to contain the single
release artifact.


> > 6) RAT: Apache RAT noticed the following files missing license info...
> >
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tika.svg
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tikaNoText.svg
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML.html
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML_utf8.html
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testRTF.rtf
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testTXT.txt
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXHTML.html
> >  !?????
> /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXML.xml
> >
> > ...I don't know if i've ever heard an opinion on needing to include the
> > ASL header in *.svg files (they are xml, but they are also clearly
> > generated by inkscape), but I do remember someone pointing out that test
> > data files in formats that are capable of containing comments in them
> (ie:
> > xml, html, etc...) should include the ASL header, such as...
> >
> >
> http://svn.apache.org/repos/asf/lucene/solr/trunk/example/exampledocs/hd.xml
>
> I think that having the license header in such test files disrupts the
> main purpose of the test cases (i.e. you want to check whether the
> extracted text contains some specific test phrase, not necessarily the
> Apache license header), so at least I prefer to not include the
> license header in those test files. See also
> http://markmail.org/message/m7jmgl3qncsffygb for related discussion on
> legal-discuss@.
>
> However, if the PMC so wishes, I don't see any big problem in us
> adding the license headers in these test files. Note that in some
> future test files this might be troublesome, but for existing tests I
> don't see problems with this.
>

I have added this already - was maybe a bit quick given the issues Jukka is
raising, but as he points out the existing test-cases are fine. We probably
want to clear this one up for the future.


> > 7) javadocs: maybe this is something that is obvious to maven users, and
> > as a non-maven user i just don't know the magic incantation, but i
> > couldn't find any generated javadocs in the release (or in the "target"
> > directory after running "mv install") ... since Tika is primarily a
> > library people will use in java apps, this seems kind of important.  If
> > there is a magic maven incantation to build these, let's included the
> > instructions somewhere (since the gettingstarted guide suggests that
> maven
> > is neccessary to build tika, but not to use it (per the Artifacts and Ant
> > sections)
>
> Good point. The README could point out "mvn site" as the way to
> produce a browseable version of all documentation associated with the
> release, and as an added service we could (should?) publish specific
> per-version documentation also on the Tika web site.
>
> On the other hand, I don't see documentation as being a valid blocker
> for any 0.x release.


Again like adding the documentation, we can improve our maven build to
generate a javadoc jar with the build however in the mean time if someone
wants to generate a javadoc jar they can use the following maven command, or
the approach mentioned by Jukka:

mvn javadoc:jar

Cheers,
Dave

Re: Tika 0.2 Release

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 28, 2008, at 7:43 PM, Jukka Zitting wrote:
>
>> 6) RAT: Apache RAT noticed the following files missing license  
>> info...
>>
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/ 
>> resources/tika.svg
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/ 
>> resources/tikaNoText.svg
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/ 
>> resources/test-documents/testHTML.html
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/ 
>> resources/test-documents/testHTML_utf8.html
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/ 
>> resources/test-documents/testRTF.rtf
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/ 
>> resources/test-documents/testTXT.txt
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/ 
>> resources/test-documents/testXHTML.html
>> !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/ 
>> resources/test-documents/testXML.xml
>>
>> ...I don't know if i've ever heard an opinion on needing to include  
>> the
>> ASL header in *.svg files (they are xml, but they are also clearly
>> generated by inkscape), but I do remember someone pointing out that  
>> test
>> data files in formats that are capable of containing comments in  
>> them (ie:
>> xml, html, etc...) should include the ASL header, such as...
>>
>> http://svn.apache.org/repos/asf/lucene/solr/trunk/example/exampledocs/hd.xml
>
> I think that having the license header in such test files disrupts the
> main purpose of the test cases (i.e. you want to check whether the
> extracted text contains some specific test phrase, not necessarily the
> Apache license header), so at least I prefer to not include the
> license header in those test files. See also
> http://markmail.org/message/m7jmgl3qncsffygb for related discussion on
> legal-discuss@.
>
> However, if the PMC so wishes, I don't see any big problem in us
> adding the license headers in these test files. Note that in some
> future test files this might be troublesome, but for existing tests I
> don't see problems with this.

I agree w/ Jukka here, I don't think those kinds of files need to have  
headers

-Grant

Re: Tika 0.2 Release

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sat, Nov 29, 2008 at 12:02 AM, Chris Hostetter
<ho...@fucit.org> wrote:
> My comments on RC1 are below.  i don't feel comfortable voting for it in
> it's current state...

Thanks for the review, much appreciated!

I think it's fair to say that with the 0.2 release we're still pretty
much in the transition for the Incubator to Lucene (and from a
developer-only product to a general end user product). The main drive
(at least from my side) for the 0.2 release was just to get whatever
we had at the moment released as soon as possible for interested users
(release early, release often), and then focus in 0.3 to get all the
extra stuff like documentation and extra build artifacts in place.

I should also note that Chris Mattman did call (see
http://markmail.org/message/ux3uc72zlwarow5i) for the release to be
made clearly either as an Incubator release or as a Lucene release
once all the project migration is done. I guess I was the main
proponent in pushing for the 0.2 release already while the Lucene
migration was still incomplete.

> 1) release naming: should probably be apache-tika-0.2-src.jar  i seem to
> recall someone somewhere saying that was important for apache releases
> (and it's more consistent with the the 0.1 release)

Good point, we probably should do that. Dave, can you take care of this?

> 2) release file format: the 0.1 release seems to have been a tar.gz ...
> was a concious choice made by the community to switch to distributing as a
> src jar? otherwise you may want to publish both, or stick with tar.gz for
> consistency (the docs on the website refer to the tarball when giving
> examples of downloading and verifying)

At least I was pretty vocal about switching to the jar format for our
source releases, see most notably
http://markmail.org/message/mwi4w2odztsxlcgi and
http://markmail.org/message/jnthn2q4pghqxjlc. Unless the PMC prefers a
tarball, at least I would rather fix the documentation than change the
packaging format.

> 3) incubator refs: as mentioned before, there are a lot of refrences to
> the incubator that should be switched to point to lucene...
>
> hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ grep -lir incubator .
> ./pom.xml
> ./src/site/apt/download.apt
> ./src/site/apt/index.apt
> ./README.txt

Fair point, and it goes with my statement above about getting the
release out as soon as possible after graduation. In Tika trunk we've
now updated all Incubator references, so any new release will have
this issue fixed. Given the PMC pushback; perhaps we should just scrap
the 0.2 release and go directly to 0.3 based on the current trunk?

> 4) user docs: (I think grant may have already mentioned this) The
> README.txt file talks about building Tika, but there doesn't seem to be
> anything in the release that describes how to use Tika ... has any thought
> been given to including more docs in the release it self? --
> gettingstarted.html perhaps? ... at the very least a paragraph should be
> added to the README refering to the gettingstarted.html page.
>
> Personally, i think including documentation.html and formats.html in the
> release are also important -- they're going to change between releases,
> probably more then the "getting started" type info, and should be
> "versioned" so moving forward people with older versions won't get
> misslead by the docs on the site.

The available documentation is already included in the source release
in src/site and can be generated with "mvn site". The fact that the
documentation isn't complete (e.g. the Getting Started guide didn't
yet exist in 0.2 release candidate) shouldn't IMHO be a blocker for a
release (especially for a 0.x one). In any case it's an area where we
are clearly getting better during the 0.x release cycle.

The README could mention "mvn site" as the command to generate the
official documentation for that release and we could include a static
snapshot of that in http://lucene.apache.org/tika/ for reference. This
is something we should look at.

> 5) artifacts missing: i tried following along with the gettingstarted.html
> (my first time using maven BTW so i may have messed something up) and ran
> into a snag... "mvn install" download a bunch of dependencies (i think
> they were maven's own dependencies since i'd never used it before), ran
> some test (these definitely had tika in the name) then downloaded some
> more things, then told me it was installing tika-0.2.jar in my ~/.m2
> directory.  When i looked at the next section "Build artifacts" it refered
> to 3 jars in my target directory -- but i only have one...
>
> hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ find target -name \*jar
> target/tika-0.2.jar
>
> ...is the gettingstarted.html wrong, or did the build not run correctly?

The Getting Started guide is wrong in claiming that the standalone jar
should be available in a 0.2 build. I've fixed this in revision
721589. Only the tika-0.2.jar is produced by the 0.2 build.

Currently the guide contains some forward-looking statements about the
potentially upcoming 0.3 release; mostly that the "standalone" and
"jdk14" artifacts are included in 0.3 (they are available in current
trunk and the related Jira issues are targeted for release in 0.3). In
general I think it's not a good idea to publish documents with such
forward-looking statements, but in this case I think there is a pretty
good consensus about the contents of Tika 0.3 and when writing the
documentation I rather opted to publishing forward-looking information
than keeping it back and having to revise the document later on.

> 6) RAT: Apache RAT noticed the following files missing license info...
>
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tika.svg
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tikaNoText.svg
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML.html
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML_utf8.html
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testRTF.rtf
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testTXT.txt
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXHTML.html
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXML.xml
>
> ...I don't know if i've ever heard an opinion on needing to include the
> ASL header in *.svg files (they are xml, but they are also clearly
> generated by inkscape), but I do remember someone pointing out that test
> data files in formats that are capable of containing comments in them (ie:
> xml, html, etc...) should include the ASL header, such as...
>
> http://svn.apache.org/repos/asf/lucene/solr/trunk/example/exampledocs/hd.xml

I think that having the license header in such test files disrupts the
main purpose of the test cases (i.e. you want to check whether the
extracted text contains some specific test phrase, not necessarily the
Apache license header), so at least I prefer to not include the
license header in those test files. See also
http://markmail.org/message/m7jmgl3qncsffygb for related discussion on
legal-discuss@.

However, if the PMC so wishes, I don't see any big problem in us
adding the license headers in these test files. Note that in some
future test files this might be troublesome, but for existing tests I
don't see problems with this.

> 7) javadocs: maybe this is something that is obvious to maven users, and
> as a non-maven user i just don't know the magic incantation, but i
> couldn't find any generated javadocs in the release (or in the "target"
> directory after running "mv install") ... since Tika is primarily a
> library people will use in java apps, this seems kind of important.  If
> there is a magic maven incantation to build these, let's included the
> instructions somewhere (since the gettingstarted guide suggests that maven
> is neccessary to build tika, but not to use it (per the Artifacts and Ant
> sections)

Good point. The README could point out "mvn site" as the way to
produce a browseable version of all documentation associated with the
release, and as an added service we could (should?) publish specific
per-version documentation also on the Tika web site.

On the other hand, I don't see documentation as being a valid blocker
for any 0.x release.

BR,

Jukka Zitting

Re: Tika 0.2 Release

Posted by Chris Hostetter <ho...@fucit.org>.
My comments on RC1 are below.  i don't feel comfortable voting for it in 
it's current state...


1) release naming: should probably be apache-tika-0.2-src.jar  i seem to 
recall someone somewhere saying that was important for apache releases 
(and it's more consistent with the the 0.1 release)

2) release file format: the 0.1 release seems to have been a tar.gz ... 
was a concious choice made by the community to switch to distributing as a 
src jar? otherwise you may want to publish both, or stick with tar.gz for 
consistency (the docs on the website refer to the tarball when giving 
examples of downloading and verifying)

3) incubator refs: as mentioned before, there are a lot of refrences to 
the incubator that should be switched to point to lucene...

hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ grep -lir incubator .
./pom.xml
./src/site/apt/download.apt
./src/site/apt/index.apt
./README.txt

4) user docs: (I think grant may have already mentioned this) The 
README.txt file talks about building Tika, but there doesn't seem to be 
anything in the release that describes how to use Tika ... has any thought 
been given to including more docs in the release it self? -- 
gettingstarted.html perhaps? ... at the very least a paragraph should be 
added to the README refering to the gettingstarted.html page.  

Personally, i think including documentation.html and formats.html in the 
release are also important -- they're going to change between releases, 
probably more then the "getting started" type info, and should be 
"versioned" so moving forward people with older versions won't get 
misslead by the docs on the site.

5) artifacts missing: i tried following along with the gettingstarted.html 
(my first time using maven BTW so i may have messed something up) and ran 
into a snag... "mvn install" download a bunch of dependencies (i think 
they were maven's own dependencies since i'd never used it before), ran 
some test (these definitely had tika in the name) then downloaded some 
more things, then told me it was installing tika-0.2.jar in my ~/.m2 
directory.  When i looked at the next section "Build artifacts" it refered 
to 3 jars in my target directory -- but i only have one...

hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ find target -name \*jar
target/tika-0.2.jar

...is the gettingstarted.html wrong, or did the build not run correctly?

6) RAT: Apache RAT noticed the following files missing license info...

 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tika.svg
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tikaNoText.svg
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML.html
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML_utf8.html
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testRTF.rtf
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testTXT.txt
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXHTML.html
 !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXML.xml

...I don't know if i've ever heard an opinion on needing to include the 
ASL header in *.svg files (they are xml, but they are also clearly 
generated by inkscape), but I do remember someone pointing out that test 
data files in formats that are capable of containing comments in them (ie: 
xml, html, etc...) should include the ASL header, such as...

http://svn.apache.org/repos/asf/lucene/solr/trunk/example/exampledocs/hd.xml

7) javadocs: maybe this is something that is obvious to maven users, and 
as a non-maven user i just don't know the magic incantation, but i 
couldn't find any generated javadocs in the release (or in the "target" 
directory after running "mv install") ... since Tika is primarily a 
library people will use in java apps, this seems kind of important.  If 
there is a magic maven incantation to build these, let's included the 
instructions somewhere (since the gettingstarted guide suggests that maven 
is neccessary to build tika, but not to use it (per the Artifacts and Ant 
sections)

FWIW: browsing the nightly snapshot javadocs online i really wasn't even 
sure where i should start.  My suggestion: documentation.html would be 
damn near perfect as an overview.html javadoc file.


-Hoss


Re: Tika 0.2 Release

Posted by Chris Hostetter <ho...@fucit.org>.
: The thread following my original email on Tika Dev can be found here:
: http://markmail.org/message/arqiobnduu6bwine

I'm not very familiar with Tika, but I will try to find some time to take 
a look at the RC this weekend.  In response to comments made in the 
previous thread however...

1) Thilo's comment about updating the URLs is important.  we should make 
sure any refrences to the incubator have been removed so there is no 
confusion about whether this is an Incubator release, or an "official" 
release.

2) Re Jukka's "How does this work in Lucene" question -- All Apache 
releases must be approved by +3 PMC votes (at least: that's the way it 
worked the last time i checked).





-Hoss