You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Ylva Degerfeldt <yl...@gmail.com> on 2008/03/14 16:45:19 UTC
Is POI-HWPF really the best way to extract text from Word files?
Hi everyone,
Maybe I shouldn't ask this on this mailing list but I'm about to start
on a project where I'm going to extract different keywords from Word
files in the most common formats (like 97 - 2003) and I'd like to know
before I start if using POI-HWPF really is the best way to do that.
The thing is.. I think I have found another way to do it: Oracle's
Clean Content SDK. Has anyone tried this? I was just wondering if it's
worth the time and effort to dig deeper into that or if I should
simply decide that POI-HWPF is the best solution and forget about the
other one. (I have a bit of a tight schedule so that's why I'm
asking.)
Thanks in advance,
Ylva
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: Is POI-HWPF really the best way to extract text from Word files?
Posted by Raghu Kaippully <ra...@gmail.com>.
All the HWPF code is under the scratchpad section in subversion repository -
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/
For a simple example, have a look at
testcases/org/apache/poi/hwpf/usermodel/TestProblems.java under the
scratchpad. It has a method testRangeDelete() that scans the text pieces.
-Raghu
On Fri, Mar 14, 2008 at 9:46 PM, Ylva Degerfeldt <yl...@gmail.com>
wrote:
> Yes, I'm only interested in extracting the text (more specifically
> searching for different keywords in cv's in Word format).
>
> Where can I find those JUnit testcases? (I'm new to this whole thing.)
>
> /Ylva
>
> On Fri, Mar 14, 2008 at 4:59 PM, Raghu Kaippully <ra...@gmail.com>
> wrote:
> > Are you just looking to extract text from word documents? Then HWPF
> probably
> > will do the trick. I am not familiar with Clean Content SDK so can't
> comment
> > on that. Why don't you give HWPF a try. Some of the JUnit testcases
> already
> > operate on extracting text, may be you can have a look at them.
> >
> > -Raghu
> >
> > On Fri, Mar 14, 2008 at 9:15 PM, Ylva Degerfeldt <
> ylva.degerfeldt@gmail.com>
> > wrote:
> >
> >
> >
> > > Hi everyone,
> > >
> > > Maybe I shouldn't ask this on this mailing list but I'm about to
> start
> > > on a project where I'm going to extract different keywords from Word
> > > files in the most common formats (like 97 - 2003) and I'd like to
> know
> > > before I start if using POI-HWPF really is the best way to do that.
> > >
> > > The thing is.. I think I have found another way to do it: Oracle's
> > > Clean Content SDK. Has anyone tried this? I was just wondering if
> it's
> > > worth the time and effort to dig deeper into that or if I should
> > > simply decide that POI-HWPF is the best solution and forget about the
> > > other one. (I have a bit of a tight schedule so that's why I'm
> > > asking.)
> > >
> > > Thanks in advance,
> > >
> > > Ylva
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> > > For additional commands, e-mail: user-help@poi.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
Re: Is POI-HWPF really the best way to extract text from Word files?
Posted by Ylva Degerfeldt <yl...@gmail.com>.
Yes, I'm only interested in extracting the text (more specifically
searching for different keywords in cv's in Word format).
Where can I find those JUnit testcases? (I'm new to this whole thing.)
/Ylva
On Fri, Mar 14, 2008 at 4:59 PM, Raghu Kaippully <ra...@gmail.com> wrote:
> Are you just looking to extract text from word documents? Then HWPF probably
> will do the trick. I am not familiar with Clean Content SDK so can't comment
> on that. Why don't you give HWPF a try. Some of the JUnit testcases already
> operate on extracting text, may be you can have a look at them.
>
> -Raghu
>
> On Fri, Mar 14, 2008 at 9:15 PM, Ylva Degerfeldt <yl...@gmail.com>
> wrote:
>
>
>
> > Hi everyone,
> >
> > Maybe I shouldn't ask this on this mailing list but I'm about to start
> > on a project where I'm going to extract different keywords from Word
> > files in the most common formats (like 97 - 2003) and I'd like to know
> > before I start if using POI-HWPF really is the best way to do that.
> >
> > The thing is.. I think I have found another way to do it: Oracle's
> > Clean Content SDK. Has anyone tried this? I was just wondering if it's
> > worth the time and effort to dig deeper into that or if I should
> > simply decide that POI-HWPF is the best solution and forget about the
> > other one. (I have a bit of a tight schedule so that's why I'm
> > asking.)
> >
> > Thanks in advance,
> >
> > Ylva
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> > For additional commands, e-mail: user-help@poi.apache.org
> >
> >
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: Is POI-HWPF really the best way to extract text from Word files?
Posted by Raghu Kaippully <ra...@gmail.com>.
Are you just looking to extract text from word documents? Then HWPF probably
will do the trick. I am not familiar with Clean Content SDK so can't comment
on that. Why don't you give HWPF a try. Some of the JUnit testcases already
operate on extracting text, may be you can have a look at them.
-Raghu
On Fri, Mar 14, 2008 at 9:15 PM, Ylva Degerfeldt <yl...@gmail.com>
wrote:
> Hi everyone,
>
> Maybe I shouldn't ask this on this mailing list but I'm about to start
> on a project where I'm going to extract different keywords from Word
> files in the most common formats (like 97 - 2003) and I'd like to know
> before I start if using POI-HWPF really is the best way to do that.
>
> The thing is.. I think I have found another way to do it: Oracle's
> Clean Content SDK. Has anyone tried this? I was just wondering if it's
> worth the time and effort to dig deeper into that or if I should
> simply decide that POI-HWPF is the best solution and forget about the
> other one. (I have a bit of a tight schedule so that's why I'm
> asking.)
>
> Thanks in advance,
>
> Ylva
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>