You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Ylva Degerfeldt <yl...@gmail.com> on 2008/03/14 16:45:19 UTC

Is POI-HWPF really the best way to extract text from Word files?

Hi everyone,

Maybe I shouldn't ask this on this mailing list but I'm about to start
on a project where I'm going to extract different keywords from Word
files in the most common formats (like 97 - 2003) and I'd like to know
before I start if using POI-HWPF really is the best way to do that.

The thing is.. I think I have found another way to do it: Oracle's
Clean Content SDK. Has anyone tried this? I was just wondering if it's
worth the time and effort to dig deeper into that or if I should
simply decide that POI-HWPF is the best solution and forget about the
other one. (I have a bit of a tight schedule so that's why I'm
asking.)

Thanks in advance,

Ylva

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Is POI-HWPF really the best way to extract text from Word files?

Posted by Raghu Kaippully <ra...@gmail.com>.
All the HWPF code is under the scratchpad section in subversion repository -
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/

For a simple example, have a look at
testcases/org/apache/poi/hwpf/usermodel/TestProblems.java under the
scratchpad. It has a method testRangeDelete() that scans the text pieces.

-Raghu

On Fri, Mar 14, 2008 at 9:46 PM, Ylva Degerfeldt <yl...@gmail.com>
wrote:

> Yes, I'm only interested in extracting the text (more specifically
> searching for different keywords in cv's in Word format).
>
> Where can I find those JUnit testcases? (I'm new to this whole thing.)
>
> /Ylva
>
> On Fri, Mar 14, 2008 at 4:59 PM, Raghu Kaippully <ra...@gmail.com>
> wrote:
> > Are you just looking to extract text from word documents? Then HWPF
> probably
> >  will do the trick. I am not familiar with Clean Content SDK so can't
> comment
> >  on that. Why don't you give HWPF a try. Some of the JUnit testcases
> already
> >  operate on extracting text, may be you can have a look at them.
> >
> >  -Raghu
> >
> >  On Fri, Mar 14, 2008 at 9:15 PM, Ylva Degerfeldt <
> ylva.degerfeldt@gmail.com>
> >  wrote:
> >
> >
> >
> >  > Hi everyone,
> >  >
> >  > Maybe I shouldn't ask this on this mailing list but I'm about to
> start
> >  > on a project where I'm going to extract different keywords from Word
> >  > files in the most common formats (like 97 - 2003) and I'd like to
> know
> >  > before I start if using POI-HWPF really is the best way to do that.
> >  >
> >  > The thing is.. I think I have found another way to do it: Oracle's
> >  > Clean Content SDK. Has anyone tried this? I was just wondering if
> it's
> >  > worth the time and effort to dig deeper into that or if I should
> >  > simply decide that POI-HWPF is the best solution and forget about the
> >  > other one. (I have a bit of a tight schedule so that's why I'm
> >  > asking.)
> >  >
> >  > Thanks in advance,
> >  >
> >  > Ylva
> >  >
> >  > ---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> >  > For additional commands, e-mail: user-help@poi.apache.org
> >  >
> >  >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Is POI-HWPF really the best way to extract text from Word files?

Posted by Ylva Degerfeldt <yl...@gmail.com>.
Yes, I'm only interested in extracting the text (more specifically
searching for different keywords in cv's in Word format).

Where can I find those JUnit testcases? (I'm new to this whole thing.)

/Ylva

On Fri, Mar 14, 2008 at 4:59 PM, Raghu Kaippully <ra...@gmail.com> wrote:
> Are you just looking to extract text from word documents? Then HWPF probably
>  will do the trick. I am not familiar with Clean Content SDK so can't comment
>  on that. Why don't you give HWPF a try. Some of the JUnit testcases already
>  operate on extracting text, may be you can have a look at them.
>
>  -Raghu
>
>  On Fri, Mar 14, 2008 at 9:15 PM, Ylva Degerfeldt <yl...@gmail.com>
>  wrote:
>
>
>
>  > Hi everyone,
>  >
>  > Maybe I shouldn't ask this on this mailing list but I'm about to start
>  > on a project where I'm going to extract different keywords from Word
>  > files in the most common formats (like 97 - 2003) and I'd like to know
>  > before I start if using POI-HWPF really is the best way to do that.
>  >
>  > The thing is.. I think I have found another way to do it: Oracle's
>  > Clean Content SDK. Has anyone tried this? I was just wondering if it's
>  > worth the time and effort to dig deeper into that or if I should
>  > simply decide that POI-HWPF is the best solution and forget about the
>  > other one. (I have a bit of a tight schedule so that's why I'm
>  > asking.)
>  >
>  > Thanks in advance,
>  >
>  > Ylva
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>  > For additional commands, e-mail: user-help@poi.apache.org
>  >
>  >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Is POI-HWPF really the best way to extract text from Word files?

Posted by Raghu Kaippully <ra...@gmail.com>.
Are you just looking to extract text from word documents? Then HWPF probably
will do the trick. I am not familiar with Clean Content SDK so can't comment
on that. Why don't you give HWPF a try. Some of the JUnit testcases already
operate on extracting text, may be you can have a look at them.

-Raghu

On Fri, Mar 14, 2008 at 9:15 PM, Ylva Degerfeldt <yl...@gmail.com>
wrote:

> Hi everyone,
>
> Maybe I shouldn't ask this on this mailing list but I'm about to start
> on a project where I'm going to extract different keywords from Word
> files in the most common formats (like 97 - 2003) and I'd like to know
> before I start if using POI-HWPF really is the best way to do that.
>
> The thing is.. I think I have found another way to do it: Oracle's
> Clean Content SDK. Has anyone tried this? I was just wondering if it's
> worth the time and effort to dig deeper into that or if I should
> simply decide that POI-HWPF is the best solution and forget about the
> other one. (I have a bit of a tight schedule so that's why I'm
> asking.)
>
> Thanks in advance,
>
> Ylva
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>