You are viewing a plain text version of this content. The canonical link for it is here.
Posted to doxia-users@maven.apache.org by krycho fandino <cr...@gmail.com> on 2008/03/01 21:13:57 UTC

Migrating documentation from HTML files

I'm a newbie using doxia. I've a lot of documentation in HTML format an I'd
like convert these files to apt format. Is there some way to transform
easily? I want to create a maven site for my project and, right now, I only
have this documentation in HTML format without css styles nor menu.

Could you help me? Very thanks
Cristóbal

Re: Migrating documentation from HTML files

Posted by Vincent Siveton <vi...@gmail.com>.
2008/3/4, Lukas Theussl <lt...@apache.org>:
> Ehm, yes, sorry, I talked quicker than I thought. Of course, the parser
>  is an xml parser so it will cough up any tags that are not properly
>  closed. So it has to be xhtml. You can use tools like htmltidy [1] to
>  convert html to xhtml.
>
>  Btw, Vincent just added a simple tool to do document translations with
>  doxia: http://svn.apache.org/viewvc?view=rev&revision=633328
>  Feel free to test and comment! :)

You need to use the entire trunk for this.

I guess it will be easy to patch the converter with jtidy to support
html as an input format. Patches are welcome :)

Cheers,

Vincent

>  Cheers,
>  -Lukas
>
>  [1] http://tidy.sourceforge.net/
>
>
>
>  Cristóbal Fandiño wrote:
>  > Output latex2html produces no XHTML code. For example:
>  >
>  > HTML
>  > ==========
>  > <LINK REL="STYLESHEET" HREF="embebidos.css">
>  >
>  > XhtmlParser
>  > ==========
>  > org.apache.maven.doxia.parser.ParseException: Error parsing the model: end
>  > tag name </HEAD> must be the same as start tag <LINK> from line 19
>  > (position: TEXT seen ...<LINK REL="STYLESHEET"
>  > HREF="embebidos.css">\n\n</HEAD>...
>  > @21:8)
>  >     at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
>  > AbstractXmlParser.java:57)
>  >
>  >
>  > HTML
>  > ==========
>  > <H2><A NAME="SECTION00221000000000000000"></A>
>  > <A NAME="74"></A>
>  > <BR>
>  > Grupos de usuarios
>  > </H2>
>  >
>  > XhtmlParser
>  > ==========
>  > org.apache.maven.doxia.parser.ParseException: Error parsing the model: end
>  > tag name </H2> must be the same as start tag <BR> from line 119 (position:
>  > TEXT seen ...<BR>\nGrupos de usuarios\n</H2>... @121:6)
>  >     at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
>  > AbstractXmlParser.java:57)
>  >
>  >
>  > XhtmlParser
>  > ==========
>  > org.apache.maven.doxia.parser.ParseException: Error parsing the model:
>  > attribute value must start with quotation or apostrophe not 3 (position:
>  > TEXT seen ...<A NAME="91"></A>\n<TABLE CELLPADDING=3... @171:21)
>  >     at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
>  > AbstractXmlParser.java:57)
>  >
>  > ... and far more
>  >
>  >
>  > 2008/3/3, Lukas Theussl <lt...@apache.org>:
>  >
>  >>doxia doesn't have a latex parser (I'd like to have one too!),
>  >>latex2html is the only solution I can think of (there exist other latex
>  >>translators though but that's the only one I know). I am not sure what
>  >>kind of output latex2html produces, however, the difference HTML - xhtml
>  >>shouldn't matter here. What kind of exceptions do you get? Maybe you
>  >>could attach an example file at jira [1] with a snippet of your code so
>  >>we can try to reproce the problem?
>  >>
>  >>-Lukas
>  >>
>  >>[1] http://jira.codehaus.org/browse/DOXIA
>  >>
>  >>
>  >>krycho fandino wrote:
>  >>
>  >>>Thanks for your help, however my HTML files isn't XHTML and XhtmlParser
>  >>>throws a lot of exceptions. Perhaps, I should convert these HTML files
>  >>
>  >>to
>  >>
>  >>>XHTML format, but I've a lot of pages and should be a hard task.
>  >>>
>  >>>Really, I has generated these HTML files using latex2html conversion
>  >>
>  >>tool. I
>  >>
>  >>>don't know how I could transform latex files to some markup languages
>  >>>supported by doxia (apt or xdoc). Could you give me some advice?
>  >>>
>  >>>
>  >>>2008/3/2, Lukas Theussl <lt...@apache.org>:
>  >>>
>  >>>
>  >>>>If you use the current development branch of doxia (beta-1-SNAPSHOT)
>  >>>>then this should work rather well for simple html files. However, you
>  >>>>will probably loose a lot of information if you have anything fancy (eg
>  >>>>special layout, tables, figures are not well supported), don't expect it
>  >>>>to be perfect. In particular if you have figures you might try to
>  >>>>translate to xdoc instead of apt (use XdocSink), that should work
>  >>
>  >>better.
>  >>
>  >>>>Cheers,
>  >>>>
>  >>>>-Lukas
>  >>>>
>  >>>>
>  >>>>
>  >>>>Vincent Siveton wrote:
>  >>>>
>  >>>>
>  >>>>>Hi,
>  >>>>>
>  >>>>>Frankly, I never test your use case.
>  >>>>>
>  >>>>>But I guess that you need to have an XHTML file in input with no
>  >>>>>header, footer or navbar something to the div bodyColumn in [1].
>  >>>>>
>  >>>>>The snippet should be something like the following:
>  >>>>>
>  >>>>>File f = new File( "blabla.html" );
>  >>>>>XhtmlParser parser = new XhtmlParser();
>  >>>>>StringWriter output = new StringWriter();
>  >>>>>Sink sink = new AptSink( output );
>  >>>>>parser.parse( new FileReader( f ), output );
>  >>>>>
>  >>>>>Output will contain APT declaration.
>  >>>>>
>  >>>>>HTH,
>  >>>>>
>  >>>>>Vincent
>  >>>>>
>  >>>>>[1] http://maven.apache.org/doxia/
>  >>>>>
>  >>>>>2008/3/1, krycho fandino <cr...@gmail.com>:
>  >>>>>
>  >>>>>
>  >>>>>
>  >>>>>>I'm a newbie using doxia. I've a lot of documentation in HTML format
>  >>
>  >>an
>  >>
>  >>>>I'd
>  >>>>
>  >>>>
>  >>>>>>like convert these files to apt format. Is there some way to transform
>  >>>>>>easily? I want to create a maven site for my project and, right now, I
>  >>>>
>  >>>>only
>  >>>>
>  >>>>
>  >>>>>>have this documentation in HTML format without css styles nor menu.
>  >>>>>>
>  >>>>>>Could you help me? Very thanks
>  >>>>>>Cristóbal
>  >>>>>
>  >>
>  >
>

Re: Migrating documentation from HTML files

Posted by Lukas Theussl <lt...@apache.org>.
Ehm, yes, sorry, I talked quicker than I thought. Of course, the parser 
is an xml parser so it will cough up any tags that are not properly 
closed. So it has to be xhtml. You can use tools like htmltidy [1] to 
convert html to xhtml.

Btw, Vincent just added a simple tool to do document translations with 
doxia: http://svn.apache.org/viewvc?view=rev&revision=633328
Feel free to test and comment! :)

Cheers,
-Lukas

[1] http://tidy.sourceforge.net/


Cristóbal Fandiño wrote:
> Output latex2html produces no XHTML code. For example:
> 
> HTML
> ==========
> <LINK REL="STYLESHEET" HREF="embebidos.css">
> 
> XhtmlParser
> ==========
> org.apache.maven.doxia.parser.ParseException: Error parsing the model: end
> tag name </HEAD> must be the same as start tag <LINK> from line 19
> (position: TEXT seen ...<LINK REL="STYLESHEET"
> HREF="embebidos.css">\n\n</HEAD>...
> @21:8)
>     at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
> AbstractXmlParser.java:57)
> 
> 
> HTML
> ==========
> <H2><A NAME="SECTION00221000000000000000"></A>
> <A NAME="74"></A>
> <BR>
> Grupos de usuarios
> </H2>
> 
> XhtmlParser
> ==========
> org.apache.maven.doxia.parser.ParseException: Error parsing the model: end
> tag name </H2> must be the same as start tag <BR> from line 119 (position:
> TEXT seen ...<BR>\nGrupos de usuarios\n</H2>... @121:6)
>     at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
> AbstractXmlParser.java:57)
> 
> 
> XhtmlParser
> ==========
> org.apache.maven.doxia.parser.ParseException: Error parsing the model:
> attribute value must start with quotation or apostrophe not 3 (position:
> TEXT seen ...<A NAME="91"></A>\n<TABLE CELLPADDING=3... @171:21)
>     at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
> AbstractXmlParser.java:57)
> 
> ... and far more
> 
> 
> 2008/3/3, Lukas Theussl <lt...@apache.org>:
> 
>>doxia doesn't have a latex parser (I'd like to have one too!),
>>latex2html is the only solution I can think of (there exist other latex
>>translators though but that's the only one I know). I am not sure what
>>kind of output latex2html produces, however, the difference HTML - xhtml
>>shouldn't matter here. What kind of exceptions do you get? Maybe you
>>could attach an example file at jira [1] with a snippet of your code so
>>we can try to reproce the problem?
>>
>>-Lukas
>>
>>[1] http://jira.codehaus.org/browse/DOXIA
>>
>>
>>krycho fandino wrote:
>>
>>>Thanks for your help, however my HTML files isn't XHTML and XhtmlParser
>>>throws a lot of exceptions. Perhaps, I should convert these HTML files
>>
>>to
>>
>>>XHTML format, but I've a lot of pages and should be a hard task.
>>>
>>>Really, I has generated these HTML files using latex2html conversion
>>
>>tool. I
>>
>>>don't know how I could transform latex files to some markup languages
>>>supported by doxia (apt or xdoc). Could you give me some advice?
>>>
>>>
>>>2008/3/2, Lukas Theussl <lt...@apache.org>:
>>>
>>>
>>>>If you use the current development branch of doxia (beta-1-SNAPSHOT)
>>>>then this should work rather well for simple html files. However, you
>>>>will probably loose a lot of information if you have anything fancy (eg
>>>>special layout, tables, figures are not well supported), don't expect it
>>>>to be perfect. In particular if you have figures you might try to
>>>>translate to xdoc instead of apt (use XdocSink), that should work
>>
>>better.
>>
>>>>Cheers,
>>>>
>>>>-Lukas
>>>>
>>>>
>>>>
>>>>Vincent Siveton wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>>Frankly, I never test your use case.
>>>>>
>>>>>But I guess that you need to have an XHTML file in input with no
>>>>>header, footer or navbar something to the div bodyColumn in [1].
>>>>>
>>>>>The snippet should be something like the following:
>>>>>
>>>>>File f = new File( "blabla.html" );
>>>>>XhtmlParser parser = new XhtmlParser();
>>>>>StringWriter output = new StringWriter();
>>>>>Sink sink = new AptSink( output );
>>>>>parser.parse( new FileReader( f ), output );
>>>>>
>>>>>Output will contain APT declaration.
>>>>>
>>>>>HTH,
>>>>>
>>>>>Vincent
>>>>>
>>>>>[1] http://maven.apache.org/doxia/
>>>>>
>>>>>2008/3/1, krycho fandino <cr...@gmail.com>:
>>>>>
>>>>>
>>>>>
>>>>>>I'm a newbie using doxia. I've a lot of documentation in HTML format
>>
>>an
>>
>>>>I'd
>>>>
>>>>
>>>>>>like convert these files to apt format. Is there some way to transform
>>>>>>easily? I want to create a maven site for my project and, right now, I
>>>>
>>>>only
>>>>
>>>>
>>>>>>have this documentation in HTML format without css styles nor menu.
>>>>>>
>>>>>>Could you help me? Very thanks
>>>>>>Cristóbal
>>>>>
>>
> 

Re: Migrating documentation from HTML files

Posted by Cristóbal Fandiño <cr...@gmail.com>.
Output latex2html produces no XHTML code. For example:

HTML
==========
<LINK REL="STYLESHEET" HREF="embebidos.css">

XhtmlParser
==========
org.apache.maven.doxia.parser.ParseException: Error parsing the model: end
tag name </HEAD> must be the same as start tag <LINK> from line 19
(position: TEXT seen ...<LINK REL="STYLESHEET"
HREF="embebidos.css">\n\n</HEAD>...
@21:8)
    at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
AbstractXmlParser.java:57)


HTML
==========
<H2><A NAME="SECTION00221000000000000000"></A>
<A NAME="74"></A>
<BR>
Grupos de usuarios
</H2>

XhtmlParser
==========
org.apache.maven.doxia.parser.ParseException: Error parsing the model: end
tag name </H2> must be the same as start tag <BR> from line 119 (position:
TEXT seen ...<BR>\nGrupos de usuarios\n</H2>... @121:6)
    at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
AbstractXmlParser.java:57)


XhtmlParser
==========
org.apache.maven.doxia.parser.ParseException: Error parsing the model:
attribute value must start with quotation or apostrophe not 3 (position:
TEXT seen ...<A NAME="91"></A>\n<TABLE CELLPADDING=3... @171:21)
    at org.apache.maven.doxia.parser.AbstractXmlParser.parse(
AbstractXmlParser.java:57)

... and far more


2008/3/3, Lukas Theussl <lt...@apache.org>:
>
> doxia doesn't have a latex parser (I'd like to have one too!),
> latex2html is the only solution I can think of (there exist other latex
> translators though but that's the only one I know). I am not sure what
> kind of output latex2html produces, however, the difference HTML - xhtml
> shouldn't matter here. What kind of exceptions do you get? Maybe you
> could attach an example file at jira [1] with a snippet of your code so
> we can try to reproce the problem?
>
> -Lukas
>
> [1] http://jira.codehaus.org/browse/DOXIA
>
>
> krycho fandino wrote:
> > Thanks for your help, however my HTML files isn't XHTML and XhtmlParser
> > throws a lot of exceptions. Perhaps, I should convert these HTML files
> to
> > XHTML format, but I've a lot of pages and should be a hard task.
> >
> > Really, I has generated these HTML files using latex2html conversion
> tool. I
> > don't know how I could transform latex files to some markup languages
> > supported by doxia (apt or xdoc). Could you give me some advice?
> >
> >
> > 2008/3/2, Lukas Theussl <lt...@apache.org>:
> >
> >>If you use the current development branch of doxia (beta-1-SNAPSHOT)
> >>then this should work rather well for simple html files. However, you
> >>will probably loose a lot of information if you have anything fancy (eg
> >>special layout, tables, figures are not well supported), don't expect it
> >>to be perfect. In particular if you have figures you might try to
> >>translate to xdoc instead of apt (use XdocSink), that should work
> better.
> >>
> >>Cheers,
> >>
> >>-Lukas
> >>
> >>
> >>
> >>Vincent Siveton wrote:
> >>
> >>>Hi,
> >>>
> >>>Frankly, I never test your use case.
> >>>
> >>>But I guess that you need to have an XHTML file in input with no
> >>>header, footer or navbar something to the div bodyColumn in [1].
> >>>
> >>>The snippet should be something like the following:
> >>>
> >>>File f = new File( "blabla.html" );
> >>>XhtmlParser parser = new XhtmlParser();
> >>>StringWriter output = new StringWriter();
> >>>Sink sink = new AptSink( output );
> >>>parser.parse( new FileReader( f ), output );
> >>>
> >>>Output will contain APT declaration.
> >>>
> >>>HTH,
> >>>
> >>>Vincent
> >>>
> >>>[1] http://maven.apache.org/doxia/
> >>>
> >>>2008/3/1, krycho fandino <cr...@gmail.com>:
> >>>
> >>>
> >>>>I'm a newbie using doxia. I've a lot of documentation in HTML format
> an
> >>
> >>I'd
> >>
> >>>>like convert these files to apt format. Is there some way to transform
> >>>>easily? I want to create a maven site for my project and, right now, I
> >>
> >>only
> >>
> >>>>have this documentation in HTML format without css styles nor menu.
> >>>>
> >>>>Could you help me? Very thanks
> >>>>Cristóbal
> >>>
> >>
> >
>

Re: Migrating documentation from HTML files

Posted by Lukas Theussl <lt...@apache.org>.
doxia doesn't have a latex parser (I'd like to have one too!), 
latex2html is the only solution I can think of (there exist other latex 
translators though but that's the only one I know). I am not sure what 
kind of output latex2html produces, however, the difference HTML - xhtml 
shouldn't matter here. What kind of exceptions do you get? Maybe you 
could attach an example file at jira [1] with a snippet of your code so 
we can try to reproce the problem?

-Lukas

[1] http://jira.codehaus.org/browse/DOXIA

krycho fandino wrote:
> Thanks for your help, however my HTML files isn't XHTML and XhtmlParser
> throws a lot of exceptions. Perhaps, I should convert these HTML files to
> XHTML format, but I've a lot of pages and should be a hard task.
> 
> Really, I has generated these HTML files using latex2html conversion tool. I
> don't know how I could transform latex files to some markup languages
> supported by doxia (apt or xdoc). Could you give me some advice?
> 
> 
> 2008/3/2, Lukas Theussl <lt...@apache.org>:
> 
>>If you use the current development branch of doxia (beta-1-SNAPSHOT)
>>then this should work rather well for simple html files. However, you
>>will probably loose a lot of information if you have anything fancy (eg
>>special layout, tables, figures are not well supported), don't expect it
>>to be perfect. In particular if you have figures you might try to
>>translate to xdoc instead of apt (use XdocSink), that should work better.
>>
>>Cheers,
>>
>>-Lukas
>>
>>
>>
>>Vincent Siveton wrote:
>>
>>>Hi,
>>>
>>>Frankly, I never test your use case.
>>>
>>>But I guess that you need to have an XHTML file in input with no
>>>header, footer or navbar something to the div bodyColumn in [1].
>>>
>>>The snippet should be something like the following:
>>>
>>>File f = new File( "blabla.html" );
>>>XhtmlParser parser = new XhtmlParser();
>>>StringWriter output = new StringWriter();
>>>Sink sink = new AptSink( output );
>>>parser.parse( new FileReader( f ), output );
>>>
>>>Output will contain APT declaration.
>>>
>>>HTH,
>>>
>>>Vincent
>>>
>>>[1] http://maven.apache.org/doxia/
>>>
>>>2008/3/1, krycho fandino <cr...@gmail.com>:
>>>
>>>
>>>>I'm a newbie using doxia. I've a lot of documentation in HTML format an
>>
>>I'd
>>
>>>>like convert these files to apt format. Is there some way to transform
>>>>easily? I want to create a maven site for my project and, right now, I
>>
>>only
>>
>>>>have this documentation in HTML format without css styles nor menu.
>>>>
>>>>Could you help me? Very thanks
>>>>Cristóbal
>>>
>>
> 

Re: Migrating documentation from HTML files

Posted by krycho fandino <cr...@gmail.com>.
Thanks for your help, however my HTML files isn't XHTML and XhtmlParser
throws a lot of exceptions. Perhaps, I should convert these HTML files to
XHTML format, but I've a lot of pages and should be a hard task.

Really, I has generated these HTML files using latex2html conversion tool. I
don't know how I could transform latex files to some markup languages
supported by doxia (apt or xdoc). Could you give me some advice?


2008/3/2, Lukas Theussl <lt...@apache.org>:
>
> If you use the current development branch of doxia (beta-1-SNAPSHOT)
> then this should work rather well for simple html files. However, you
> will probably loose a lot of information if you have anything fancy (eg
> special layout, tables, figures are not well supported), don't expect it
> to be perfect. In particular if you have figures you might try to
> translate to xdoc instead of apt (use XdocSink), that should work better.
>
> Cheers,
>
> -Lukas
>
>
>
> Vincent Siveton wrote:
> > Hi,
> >
> > Frankly, I never test your use case.
> >
> > But I guess that you need to have an XHTML file in input with no
> > header, footer or navbar something to the div bodyColumn in [1].
> >
> > The snippet should be something like the following:
> >
> > File f = new File( "blabla.html" );
> > XhtmlParser parser = new XhtmlParser();
> > StringWriter output = new StringWriter();
> > Sink sink = new AptSink( output );
> > parser.parse( new FileReader( f ), output );
> >
> > Output will contain APT declaration.
> >
> > HTH,
> >
> > Vincent
> >
> > [1] http://maven.apache.org/doxia/
> >
> > 2008/3/1, krycho fandino <cr...@gmail.com>:
> >
> >>I'm a newbie using doxia. I've a lot of documentation in HTML format an
> I'd
> >> like convert these files to apt format. Is there some way to transform
> >> easily? I want to create a maven site for my project and, right now, I
> only
> >> have this documentation in HTML format without css styles nor menu.
> >>
> >> Could you help me? Very thanks
> >> Cristóbal
> >
> >>
> >
>

Re: Migrating documentation from HTML files

Posted by Lukas Theussl <lt...@apache.org>.
If you use the current development branch of doxia (beta-1-SNAPSHOT) 
then this should work rather well for simple html files. However, you 
will probably loose a lot of information if you have anything fancy (eg 
special layout, tables, figures are not well supported), don't expect it 
to be perfect. In particular if you have figures you might try to 
translate to xdoc instead of apt (use XdocSink), that should work better.

Cheers,
-Lukas


Vincent Siveton wrote:
> Hi,
> 
> Frankly, I never test your use case.
> 
> But I guess that you need to have an XHTML file in input with no
> header, footer or navbar something to the div bodyColumn in [1].
> 
> The snippet should be something like the following:
> 
> File f = new File( "blabla.html" );
> XhtmlParser parser = new XhtmlParser();
> StringWriter output = new StringWriter();
> Sink sink = new AptSink( output );
> parser.parse( new FileReader( f ), output );
> 
> Output will contain APT declaration.
> 
> HTH,
> 
> Vincent
> 
> [1] http://maven.apache.org/doxia/
> 
> 2008/3/1, krycho fandino <cr...@gmail.com>:
> 
>>I'm a newbie using doxia. I've a lot of documentation in HTML format an I'd
>> like convert these files to apt format. Is there some way to transform
>> easily? I want to create a maven site for my project and, right now, I only
>> have this documentation in HTML format without css styles nor menu.
>>
>> Could you help me? Very thanks
>> Cristóbal
> 
>>
> 

Re: Migrating documentation from HTML files

Posted by Vincent Siveton <vi...@gmail.com>.
Hi,

Frankly, I never test your use case.

But I guess that you need to have an XHTML file in input with no
header, footer or navbar something to the div bodyColumn in [1].

The snippet should be something like the following:

File f = new File( "blabla.html" );
XhtmlParser parser = new XhtmlParser();
StringWriter output = new StringWriter();
Sink sink = new AptSink( output );
parser.parse( new FileReader( f ), output );

Output will contain APT declaration.

HTH,

Vincent

[1] http://maven.apache.org/doxia/

2008/3/1, krycho fandino <cr...@gmail.com>:
> I'm a newbie using doxia. I've a lot of documentation in HTML format an I'd
>  like convert these files to apt format. Is there some way to transform
>  easily? I want to create a maven site for my project and, right now, I only
>  have this documentation in HTML format without css styles nor menu.
>
>  Could you help me? Very thanks
>  Cristóbal
>