You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openoffice.apache.org by Andrea Pescetti <pe...@apache.org> on 2012/12/16 20:47:17 UTC

[WEBSITE] Problem with .htm files

I had a long discussion with Infra today trying to find out why a change 
I had applied was not appearing. Analyzing it, it turns out that we have 
a problem already visible on over 400 pages and related to .htm files 
(as opposed to .html files).

Reproducing is easy:
1) Edit a .htm file, e.g., do this:
http://svn.apache.org/viewvc/openoffice/ooo-site/trunk/content/pt/about/newsletter.htm?r1=1413471&r2=1422592&diff_format=h

2) Publish the changes and you get file duplication:

http://www.openoffice.org/pt/about/newsletter.htm
(the existing URL, ending in .htm, not updated)

http://www.openoffice.org/pt/about/newsletter.html
(a new URL, containing the fix)

This silent change of URLs is quite scary and we already have 401 
"duplicate" pages. For other examples see

http://www.openoffice.org/fr/Documentation/liens.htm
http://www.openoffice.org/fr/Documentation/liens.html

or

http://www.openoffice.org/ui/proposals/Readonly_mode.htm
http://www.openoffice.org/ui/proposals/Readonly_mode.html

Daniel Shahaf, who investigated the problem, suggests that we take a 
look at our path.pm.

Looking at it, I think the place to start investigating is line 14 of
http://svn.apache.org/viewvc/openoffice/ooo-site/trunk/lib/path.pm?revision=1413471&view=markup
which seems to actually turn .htm files into .html files, but it's 
probably best that someone familiar with the CMS does the change, since 
I definitely don't want to break the website.

Regards,
   Andrea.

Re: [WEBSITE] Problem with .htm files

Posted by Kay Schenk <ka...@gmail.com>.

On 12/17/2012 04:38 PM, Dave Fisher wrote:
> Hi All,
>
> All done. .htm files remain .htm files on the server.
>
> The last step was to make sure that the SSI happened via the .htaccess.
>
> Later we will need to purge the duplicates.
>
> Regards,
> Dave

good news! Much less confusion. I knew we had a goodly number of ".htm" 
pages but I hadn't checked what was happening with them until Andrea 
brought this up.


>
>
> On Dec 17, 2012, at 2:11 PM, Dave Fisher wrote:
>
>>
>> On Dec 17, 2012, at 12:44 PM, Dave Fisher wrote:
>>
>>>
>>> On Dec 17, 2012, at 10:56 AM, Daniel Shahaf wrote:
>>>
>>>> Dave Fisher wrote on Mon, Dec 17, 2012 at 10:20:05 -0800:
>>>>>
>>>>> On Dec 17, 2012, at 9:29 AM, Kay Schenk wrote:
>>>>>
>>>>>> On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
>>>>>>> Hi Andrea,
>>>>>>>
>>>>>>> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
>>>>>>>
>>>>>>>> Dave Fisher wrote:
>>>>>>>>> I think that we can purge these *.htm duplicates, but if we do it
>>>>>>>>> will be a "sledgehammer" build.
>>>>>>>>
>>>>>>>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
>>>>>>>
>>>>>>> Got it.
>>>>>>>
>>>>>>>>> It was intentional. Before doing so we would need to make a group
>>>>>>>>> decision about how to treat the two types of files.
>>>>>>>>
>>>>>>>> Regardless of what templates we apply, the best solution should:
>>>>>>>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
>>>>>>>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
>>>>>>>
>>>>>>> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.
>>>>>>
>>>>>> Dave, Andrea --
>>>>>>
>>>>>> Only ONE copy is in source, the "htm" file. The duplicate gets
>>>>>> generated from CMS -- but the new "html" is the most recent copy (on
>>>>>> the actual web tree) -- generated from "htm".
>>>>>>
>>>>>> Could we fix our templating to just continue to allow for "htm"
>>>>>> instead of combing them as we're doing now?
>>>>>
>>>>> It can be tried on a local copy. The prospective changes are required in lib/view.pm, but exactly what these changes are I am "guessing" at this point.
>>>>>
>>>>> It will be something about determining which type of page htm vs. html and then make the appropriate call here:
>>>>>
>>>>> I think, but do not know. If someone wants to experiment on a local build then I'll give pointers, but I may not have time to check for a day or two.
>>>>
>>>> I think you could define:
>>>>
>>>> sub htm_page {
>>>> my (@r) = html_page @_;
>>>> $r[1] = 'html' if $r[1] eq 'htm';
>>>> @r
>>>> }
>>>>
>>>> and then use that in path.pm.
>>>
>>> Thank you. I'll give that a try in a few hours when I finish my work day.
>>>
>>> Meanwhile it is likely that you will delay the JIRA issue. I'll keep you posted both here and there.
>>
>> Actually the r[1] line needed to be reversed.
>>
>> Here is the patch about to be applied:
>>
>> Index: view.pm
>> ===================================================================
>> --- view.pm	(revision 1423170)
>> +++ view.pm	(working copy)
>> @@ -101,6 +101,12 @@
>>      return Template($template)->render(\%args), html => \%args;
>> }
>>
>> +sub htm_page {
>> + my (@r) = html_page @_;
>> + $r[1] = 'htm' if $r[1] eq 'html';
>> + @r
>> +}
>> +
>> sub sitemap {
>>      my %args = @_;
>>      my $template = "content$args{path}";
>> Index: path.pm
>> ===================================================================
>> --- path.pm	(revision 1423170)
>> +++ path.pm	(working copy)
>> @@ -11,7 +11,7 @@
>> 	[qr!rightnav.mdtext$!, single_narrative => { template => "navigator.html" }],
>> 	[qr!\.mdtext$!, single_narrative => { template => "single_narrative.html" }],
>> 	[qr!\.html$!, html_page => { template => "html_page.html" }],
>> -	[qr!\.htm$!, html_page => { template => "html_page.html" }],
>> +	[qr!\.htm$!, htm_page => { template => "html_page.html" }],
>> ) ;
>>
>> # for specifying interdependencies between the files
>>
>> We can discuss the cleanup of the old *.html files later.
>>
>> Regards,
>> Dave
>>
>>>
>>> Regards,
>>> Dave
>>>
>>>
>>>
>>
>

-- 
------------------------------------------------------------------------
MzK

"No act of kindness, no matter how small, is ever wasted."
                                  -- Aesop

Re: [WEBSITE] Problem with .htm files

Posted by Andrea Pescetti <pe...@apache.org>.
On 18/12/2012 Dave Fisher wrote:
> All done. .htm files remain .htm files on the server.

Thanks! I confirm that
http://www.openoffice.org/pt/about/about.htm
and the related pages, where I found the problem, now work correctly.

> Later we will need to purge the duplicates.

Better, but not high priority since anyway we don't have links to those 
pages anywhere (or, at least, we shouldn't).

Regards,
   Andrea.

Re: [WEBSITE] Problem with .htm files

Posted by Dave Fisher <da...@comcast.net>.
Hi All,

All done. .htm files remain .htm files on the server.

The last step was to make sure that the SSI happened via the .htaccess.

Later we will need to purge the duplicates.

Regards,
Dave


On Dec 17, 2012, at 2:11 PM, Dave Fisher wrote:

> 
> On Dec 17, 2012, at 12:44 PM, Dave Fisher wrote:
> 
>> 
>> On Dec 17, 2012, at 10:56 AM, Daniel Shahaf wrote:
>> 
>>> Dave Fisher wrote on Mon, Dec 17, 2012 at 10:20:05 -0800:
>>>> 
>>>> On Dec 17, 2012, at 9:29 AM, Kay Schenk wrote:
>>>> 
>>>>> On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
>>>>>> Hi Andrea,
>>>>>> 
>>>>>> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
>>>>>> 
>>>>>>> Dave Fisher wrote:
>>>>>>>> I think that we can purge these *.htm duplicates, but if we do it
>>>>>>>> will be a "sledgehammer" build.
>>>>>>> 
>>>>>>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
>>>>>> 
>>>>>> Got it.
>>>>>> 
>>>>>>>> It was intentional. Before doing so we would need to make a group
>>>>>>>> decision about how to treat the two types of files.
>>>>>>> 
>>>>>>> Regardless of what templates we apply, the best solution should:
>>>>>>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
>>>>>>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
>>>>>> 
>>>>>> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.
>>>>> 
>>>>> Dave, Andrea --
>>>>> 
>>>>> Only ONE copy is in source, the "htm" file. The duplicate gets
>>>>> generated from CMS -- but the new "html" is the most recent copy (on
>>>>> the actual web tree) -- generated from "htm".
>>>>> 
>>>>> Could we fix our templating to just continue to allow for "htm"
>>>>> instead of combing them as we're doing now?
>>>> 
>>>> It can be tried on a local copy. The prospective changes are required in lib/view.pm, but exactly what these changes are I am "guessing" at this point.
>>>> 
>>>> It will be something about determining which type of page htm vs. html and then make the appropriate call here:
>>>> 
>>>> I think, but do not know. If someone wants to experiment on a local build then I'll give pointers, but I may not have time to check for a day or two.
>>> 
>>> I think you could define:
>>> 
>>> sub htm_page {
>>> my (@r) = html_page @_;
>>> $r[1] = 'html' if $r[1] eq 'htm';
>>> @r
>>> }
>>> 
>>> and then use that in path.pm.
>> 
>> Thank you. I'll give that a try in a few hours when I finish my work day.
>> 
>> Meanwhile it is likely that you will delay the JIRA issue. I'll keep you posted both here and there.
> 
> Actually the r[1] line needed to be reversed.
> 
> Here is the patch about to be applied:
> 
> Index: view.pm
> ===================================================================
> --- view.pm	(revision 1423170)
> +++ view.pm	(working copy)
> @@ -101,6 +101,12 @@
>     return Template($template)->render(\%args), html => \%args;
> }
> 
> +sub htm_page {
> + my (@r) = html_page @_;
> + $r[1] = 'htm' if $r[1] eq 'html';
> + @r
> +}
> +
> sub sitemap {
>     my %args = @_;
>     my $template = "content$args{path}";
> Index: path.pm
> ===================================================================
> --- path.pm	(revision 1423170)
> +++ path.pm	(working copy)
> @@ -11,7 +11,7 @@
> 	[qr!rightnav.mdtext$!, single_narrative => { template => "navigator.html" }],
> 	[qr!\.mdtext$!, single_narrative => { template => "single_narrative.html" }],
> 	[qr!\.html$!, html_page => { template => "html_page.html" }],
> -	[qr!\.htm$!, html_page => { template => "html_page.html" }],
> +	[qr!\.htm$!, htm_page => { template => "html_page.html" }],
> ) ;
> 
> # for specifying interdependencies between the files
> 
> We can discuss the cleanup of the old *.html files later.
> 
> Regards,
> Dave
> 
>> 
>> Regards,
>> Dave
>> 
>> 
>> 
> 


Re: [WEBSITE] Problem with .htm files

Posted by Dave Fisher <da...@comcast.net>.
On Dec 17, 2012, at 12:44 PM, Dave Fisher wrote:

> 
> On Dec 17, 2012, at 10:56 AM, Daniel Shahaf wrote:
> 
>> Dave Fisher wrote on Mon, Dec 17, 2012 at 10:20:05 -0800:
>>> 
>>> On Dec 17, 2012, at 9:29 AM, Kay Schenk wrote:
>>> 
>>>> On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
>>>>> Hi Andrea,
>>>>> 
>>>>> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
>>>>> 
>>>>>> Dave Fisher wrote:
>>>>>>> I think that we can purge these *.htm duplicates, but if we do it
>>>>>>> will be a "sledgehammer" build.
>>>>>> 
>>>>>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
>>>>> 
>>>>> Got it.
>>>>> 
>>>>>>> It was intentional. Before doing so we would need to make a group
>>>>>>> decision about how to treat the two types of files.
>>>>>> 
>>>>>> Regardless of what templates we apply, the best solution should:
>>>>>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
>>>>>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
>>>>> 
>>>>> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.
>>>> 
>>>> Dave, Andrea --
>>>> 
>>>> Only ONE copy is in source, the "htm" file. The duplicate gets
>>>> generated from CMS -- but the new "html" is the most recent copy (on
>>>> the actual web tree) -- generated from "htm".
>>>> 
>>>> Could we fix our templating to just continue to allow for "htm"
>>>> instead of combing them as we're doing now?
>>> 
>>> It can be tried on a local copy. The prospective changes are required in lib/view.pm, but exactly what these changes are I am "guessing" at this point.
>>> 
>>> It will be something about determining which type of page htm vs. html and then make the appropriate call here:
>>> 
>>> I think, but do not know. If someone wants to experiment on a local build then I'll give pointers, but I may not have time to check for a day or two.
>> 
>> I think you could define:
>> 
>> sub htm_page {
>> my (@r) = html_page @_;
>> $r[1] = 'html' if $r[1] eq 'htm';
>> @r
>> }
>> 
>> and then use that in path.pm.
> 
> Thank you. I'll give that a try in a few hours when I finish my work day.
> 
> Meanwhile it is likely that you will delay the JIRA issue. I'll keep you posted both here and there.

Actually the r[1] line needed to be reversed.

Here is the patch about to be applied:

Index: view.pm
===================================================================
--- view.pm	(revision 1423170)
+++ view.pm	(working copy)
@@ -101,6 +101,12 @@
     return Template($template)->render(\%args), html => \%args;
 }
 
+sub htm_page {
+ my (@r) = html_page @_;
+ $r[1] = 'htm' if $r[1] eq 'html';
+ @r
+}
+
 sub sitemap {
     my %args = @_;
     my $template = "content$args{path}";
Index: path.pm
===================================================================
--- path.pm	(revision 1423170)
+++ path.pm	(working copy)
@@ -11,7 +11,7 @@
 	[qr!rightnav.mdtext$!, single_narrative => { template => "navigator.html" }],
 	[qr!\.mdtext$!, single_narrative => { template => "single_narrative.html" }],
 	[qr!\.html$!, html_page => { template => "html_page.html" }],
-	[qr!\.htm$!, html_page => { template => "html_page.html" }],
+	[qr!\.htm$!, htm_page => { template => "html_page.html" }],
 ) ;
 
 # for specifying interdependencies between the files

We can discuss the cleanup of the old *.html files later.

Regards,
Dave

> 
> Regards,
> Dave
> 
> 
> 


Re: [WEBSITE] Problem with .htm files

Posted by Dave Fisher <da...@comcast.net>.
On Dec 17, 2012, at 10:56 AM, Daniel Shahaf wrote:

> Dave Fisher wrote on Mon, Dec 17, 2012 at 10:20:05 -0800:
>> 
>> On Dec 17, 2012, at 9:29 AM, Kay Schenk wrote:
>> 
>>> On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
>>>> Hi Andrea,
>>>> 
>>>> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
>>>> 
>>>>> Dave Fisher wrote:
>>>>>> I think that we can purge these *.htm duplicates, but if we do it
>>>>>> will be a "sledgehammer" build.
>>>>> 
>>>>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
>>>> 
>>>> Got it.
>>>> 
>>>>>> It was intentional. Before doing so we would need to make a group
>>>>>> decision about how to treat the two types of files.
>>>>> 
>>>>> Regardless of what templates we apply, the best solution should:
>>>>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
>>>>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
>>>> 
>>>> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.
>>> 
>>> Dave, Andrea --
>>> 
>>> Only ONE copy is in source, the "htm" file. The duplicate gets
>>> generated from CMS -- but the new "html" is the most recent copy (on
>>> the actual web tree) -- generated from "htm".
>>> 
>>> Could we fix our templating to just continue to allow for "htm"
>>> instead of combing them as we're doing now?
>> 
>> It can be tried on a local copy. The prospective changes are required in lib/view.pm, but exactly what these changes are I am "guessing" at this point.
>> 
>> It will be something about determining which type of page htm vs. html and then make the appropriate call here:
>> 
>> I think, but do not know. If someone wants to experiment on a local build then I'll give pointers, but I may not have time to check for a day or two.
> 
> I think you could define:
> 
> sub htm_page {
>  my (@r) = html_page @_;
>  $r[1] = 'html' if $r[1] eq 'htm';
>  @r
> }
> 
> and then use that in path.pm.

Thank you. I'll give that a try in a few hours when I finish my work day.

Meanwhile it is likely that you will delay the JIRA issue. I'll keep you posted both here and there.

Regards,
Dave




Re: [WEBSITE] Problem with .htm files

Posted by Daniel Shahaf <da...@apache.org>.
Dave Fisher wrote on Mon, Dec 17, 2012 at 10:20:05 -0800:
> 
> On Dec 17, 2012, at 9:29 AM, Kay Schenk wrote:
> 
> > On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
> >> Hi Andrea,
> >> 
> >> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
> >> 
> >>> Dave Fisher wrote:
> >>>> I think that we can purge these *.htm duplicates, but if we do it
> >>>> will be a "sledgehammer" build.
> >>> 
> >>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
> >> 
> >> Got it.
> >> 
> >>>> It was intentional. Before doing so we would need to make a group
> >>>> decision about how to treat the two types of files.
> >>> 
> >>> Regardless of what templates we apply, the best solution should:
> >>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
> >>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
> >> 
> >> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.
> > 
> > Dave, Andrea --
> > 
> > Only ONE copy is in source, the "htm" file. The duplicate gets
> > generated from CMS -- but the new "html" is the most recent copy (on
> > the actual web tree) -- generated from "htm".
> > 
> > Could we fix our templating to just continue to allow for "htm"
> > instead of combing them as we're doing now?
> 
> It can be tried on a local copy. The prospective changes are required in lib/view.pm, but exactly what these changes are I am "guessing" at this point.
> 
> It will be something about determining which type of page htm vs. html and then make the appropriate call here:
> 
> I think, but do not know. If someone wants to experiment on a local build then I'll give pointers, but I may not have time to check for a day or two.

I think you could define:

sub htm_page {
  my (@r) = html_page @_;
  $r[1] = 'html' if $r[1] eq 'htm';
  @r
}

and then use that in path.pm.

Re: [WEBSITE] Problem with .htm files

Posted by Dave Fisher <da...@comcast.net>.
On Dec 17, 2012, at 9:29 AM, Kay Schenk wrote:

> On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
>> Hi Andrea,
>> 
>> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
>> 
>>> Dave Fisher wrote:
>>>> I think that we can purge these *.htm duplicates, but if we do it
>>>> will be a "sledgehammer" build.
>>> 
>>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
>> 
>> Got it.
>> 
>>>> It was intentional. Before doing so we would need to make a group
>>>> decision about how to treat the two types of files.
>>> 
>>> Regardless of what templates we apply, the best solution should:
>>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
>>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
>> 
>> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.
> 
> Dave, Andrea --
> 
> Only ONE copy is in source, the "htm" file. The duplicate gets
> generated from CMS -- but the new "html" is the most recent copy (on
> the actual web tree) -- generated from "htm".
> 
> Could we fix our templating to just continue to allow for "htm"
> instead of combing them as we're doing now?

It can be tried on a local copy. The prospective changes are required in lib/view.pm, but exactly what these changes are I am "guessing" at this point.

It will be something about determining which type of page htm vs. html and then make the appropriate call here:

I think, but do not know. If someone wants to experiment on a local build then I'll give pointers, but I may not have time to check for a day or two.

> maybe that would work. The
> web server seems happy enough to server up "htm" in addition to "html"
> 
> I don't know what this would do to the "html" file now on the web
> server. maybe a  re-publish for say /pt/about would make this new
> "html" file go away once we fixed the templating.

Either way we need to republish those directories somehow to either remove the extra htm or the extra html files.

The redirect that has been requested will force all *.htm into *.html which I find to be simpler, but may be confusing to volunteers.

Regards,
Dave


> 
> Thoughts?
> 
>> 
>> See https://issues.apache.org/jira/browse/INFRA-5668 for this request along with a set to avoid an incubator redirect for certain links.
>> 
>> We do not need to do (2) because we already are making this change in the staging and publish. You see no diffs for the old htm files because they are not changed. I do see diffs in the html versions of the pages. Even with the redirect in place, it still makes sense to edit the pages to use *.html and not *.htm in links.
>> 
>>> 
>>>> There are two different procedures from view.pm used:  ...
>>>> There are several templates used from templates/.
>>> 
>>> To me, .htm and .html are not different file types and were never used as such: I mean, volunteers historically committed .htm or .html according to their habits, but it doesn't make sense to have different ways of handling them now. So I would tend to rename all .htm to .html and put the .htaccess redirect in place, and have only one "type" of HTML files to handle.
>> 
>> I think it is ok to force the pages to be *.html. We should have some consistency. Maybe soon it will be time to start switching openoffice.org to mdtext.
>> 
>> Regards,
>> Dave
>> 
>>> 
>>> Regards,
>>> Andrea.
>> 
> 
> 
> 
> -- 
> ----------------------------------------------------------------------------------------
> MzK
> 
> "No act of kindness, no matter how small, is ever wasted."
> 
>  -- Aesop


Re: [WEBSITE] Problem with .htm files

Posted by Kay Schenk <ka...@gmail.com>.
On Sun, Dec 16, 2012 at 3:54 PM, Dave Fisher <da...@comcast.net> wrote:
> Hi Andrea,
>
> On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:
>
>> Dave Fisher wrote:
>>> I think that we can purge these *.htm duplicates, but if we do it
>>> will be a "sledgehammer" build.
>>
>> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.
>
> Got it.
>
>>> It was intentional. Before doing so we would need to make a group
>>> decision about how to treat the two types of files.
>>
>> Regardless of what templates we apply, the best solution should:
>> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
>> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.
>
> We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.

Dave, Andrea --

Only ONE copy is in source, the "htm" file. The duplicate gets
generated from CMS -- but the new "html" is the most recent copy (on
the actual web tree) -- generated from "htm".

Could we fix our templating to just continue to allow for "htm"
instead of combing them as we're doing now? maybe that would work. The
web server seems happy enough to server up "htm" in addition to "html"

I don't know what this would do to the "html" file now on the web
server. maybe a  re-publish for say /pt/about would make this new
"html" file go away once we fixed the templating.

Thoughts?

>
> See https://issues.apache.org/jira/browse/INFRA-5668 for this request along with a set to avoid an incubator redirect for certain links.
>
> We do not need to do (2) because we already are making this change in the staging and publish. You see no diffs for the old htm files because they are not changed. I do see diffs in the html versions of the pages. Even with the redirect in place, it still makes sense to edit the pages to use *.html and not *.htm in links.
>
>>
>>> There are two different procedures from view.pm used:  ...
>>> There are several templates used from templates/.
>>
>> To me, .htm and .html are not different file types and were never used as such: I mean, volunteers historically committed .htm or .html according to their habits, but it doesn't make sense to have different ways of handling them now. So I would tend to rename all .htm to .html and put the .htaccess redirect in place, and have only one "type" of HTML files to handle.
>
> I think it is ok to force the pages to be *.html. We should have some consistency. Maybe soon it will be time to start switching openoffice.org to mdtext.
>
> Regards,
> Dave
>
>>
>> Regards,
>>  Andrea.
>



-- 
----------------------------------------------------------------------------------------
MzK

"No act of kindness, no matter how small, is ever wasted."

  -- Aesop

Re: [WEBSITE] Problem with .htm files

Posted by Dave Fisher <da...@comcast.net>.
Hi Andrea,

On Dec 16, 2012, at 2:44 PM, Andrea Pescetti wrote:

> Dave Fisher wrote:
>> I think that we can purge these *.htm duplicates, but if we do it
>> will be a "sledgehammer" build.
> 
> It will also be a problem, unless we accompany it with other changes: for example, http://www.openoffice.org/pt/ would completely break, and all external sites that now link to some of our .htm files would break too.

Got it.

>> It was intentional. Before doing so we would need to make a group
>> decision about how to treat the two types of files.
> 
> Regardless of what templates we apply, the best solution should:
> 1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve existing internal and external links)
> 2) Have the SVN file names match the URLs: editing a file named "news.htm" in SVN should not result in a change in a page with URL ".../news.html". The current handling confuses the CMS too (for example, no diff is reported). So either we mass-rename files from .htm to .html and rely on 1) above, or we don't change .htm to .html but publish .htm URLs.

We need only do (1) and I would do it within the httpd config like our existing redirects. Regardless if there are both file1.htm and file1.html in the source, one of these must be removed from the source svn.

See https://issues.apache.org/jira/browse/INFRA-5668 for this request along with a set to avoid an incubator redirect for certain links.

We do not need to do (2) because we already are making this change in the staging and publish. You see no diffs for the old htm files because they are not changed. I do see diffs in the html versions of the pages. Even with the redirect in place, it still makes sense to edit the pages to use *.html and not *.htm in links.

> 
>> There are two different procedures from view.pm used:  ...
>> There are several templates used from templates/.
> 
> To me, .htm and .html are not different file types and were never used as such: I mean, volunteers historically committed .htm or .html according to their habits, but it doesn't make sense to have different ways of handling them now. So I would tend to rename all .htm to .html and put the .htaccess redirect in place, and have only one "type" of HTML files to handle.

I think it is ok to force the pages to be *.html. We should have some consistency. Maybe soon it will be time to start switching openoffice.org to mdtext.

Regards,
Dave

> 
> Regards,
>  Andrea.


Re: [WEBSITE] Problem with .htm files

Posted by Andrea Pescetti <pe...@apache.org>.
Dave Fisher wrote:
> I think that we can purge these *.htm duplicates, but if we do it
> will be a "sledgehammer" build.

It will also be a problem, unless we accompany it with other changes: 
for example, http://www.openoffice.org/pt/ would completely break, and 
all external sites that now link to some of our .htm files would break too.

> It was intentional. Before doing so we would need to make a group
> decision about how to treat the two types of files.

Regardless of what templates we apply, the best solution should:
1) Allow a .htaccess redirect/rewrite from .htm to .html (to preserve 
existing internal and external links)
2) Have the SVN file names match the URLs: editing a file named 
"news.htm" in SVN should not result in a change in a page with URL 
".../news.html". The current handling confuses the CMS too (for example, 
no diff is reported). So either we mass-rename files from .htm to .html 
and rely on 1) above, or we don't change .htm to .html but publish .htm 
URLs.

> There are two different procedures from view.pm used:  ...
> There are several templates used from templates/.

To me, .htm and .html are not different file types and were never used 
as such: I mean, volunteers historically committed .htm or .html 
according to their habits, but it doesn't make sense to have different 
ways of handling them now. So I would tend to rename all .htm to .html 
and put the .htaccess redirect in place, and have only one "type" of 
HTML files to handle.

Regards,
   Andrea.

Re: [WEBSITE] Problem with .htm files

Posted by Dave Fisher <da...@comcast.net>.
On Dec 16, 2012, at 12:41 PM, Rob Weir wrote:

> On Sun, Dec 16, 2012 at 2:47 PM, Andrea Pescetti <pe...@apache.org> wrote:
>> I had a long discussion with Infra today trying to find out why a change I
>> had applied was not appearing. Analyzing it, it turns out that we have a
>> problem already visible on over 400 pages and related to .htm files (as
>> opposed to .html files).
>> 
>> Reproducing is easy:
>> 1) Edit a .htm file, e.g., do this:
>> http://svn.apache.org/viewvc/openoffice/ooo-site/trunk/content/pt/about/newsletter.htm?r1=1413471&r2=1422592&diff_format=h
>> 
>> 2) Publish the changes and you get file duplication:
>> 
>> http://www.openoffice.org/pt/about/newsletter.htm
>> (the existing URL, ending in .htm, not updated)
>> 
>> http://www.openoffice.org/pt/about/newsletter.html
>> (a new URL, containing the fix)
>> 
>> This silent change of URLs is quite scary and we already have 401
>> "duplicate" pages. For other examples see
>> 
>> http://www.openoffice.org/fr/Documentation/liens.htm
>> http://www.openoffice.org/fr/Documentation/liens.html
>> 
> 
> 
> When I build locally I see that input htm files are published as html
> files.  But I don't see any duplications.  Maybe the duplicates are
> just left over from earlier?

Exactly.

From path.pm
        [qr!\.html$!, html_page => { template => "html_page.html" }],
        [qr!\.htm$!, html_page => { template => "html_page.html" }],

r1221295 | wave | 2011-12-20 06:52:47 -0800 (Tue, 20 Dec 2011) | 1 line

Wrap .htm files like .html. Comment a couple of PayPal references. The page "donate-thanks.html" states that it is landing page after PayPal donations to TOO - changed to request donations to the ASF. (Not sure if it is still used.)

Daniel example was from three hours prior on Dec. 20, 2011.

I think that we can purge these *.htm duplicates, but if we do it will be a "sledgehammer" build.

> 
> 
>> or
>> 
>> http://www.openoffice.org/ui/proposals/Readonly_mode.htm
>> http://www.openoffice.org/ui/proposals/Readonly_mode.html
>> 
>> Daniel Shahaf, who investigated the problem, suggests that we take a look at
>> our path.pm.
>> 
>> Looking at it, I think the place to start investigating is line 14 of
>> http://svn.apache.org/viewvc/openoffice/ooo-site/trunk/lib/path.pm?revision=1413471&view=markup
>> which seems to actually turn .htm files into .html files, but it's probably
>> best that someone familiar with the CMS does the change, since I definitely
>> don't want to break the website.

It was intentional. Before doing so we would need to make a group decision about how to treat the two types of files.

        [qr!\.html$!, html_page => { template => "html_page.html" }],
        [qr!\.htm$!, html_page => { template => "html_page.html" }],

Note that this will change htm to html just like the folllowing mdtext files are changed into html:

        [qr!doctype.mdtext$!, single_narrative => { template => "doctype.html" }],
        [qr!brand.mdtext$!, single_narrative => { template => "brand.html" }],
        [qr!footer.mdtext$!, single_narrative => { template => "footer.html" }],
        [qr!topnav.mdtext$!, single_narrative => { template => "navigator.html" }],
        [qr!leftnav.mdtext$!, single_narrative => { template => "navigator.html" }],
        [qr!rightnav.mdtext$!, single_narrative => { template => "navigator.html" }],
        [qr!\.mdtext$!, single_narrative => { template => "single_narrative.html" }],

There are two different procedures from view.pm used:

single_narrative and html_page.

There are several templates used from templates/.

html_page.html
single_narrative.html
navigator.html
doctype.html
brand.html
footer.html

Regards,
Dave

>> 
>> Regards,
>>  Andrea.


Re: [WEBSITE] Problem with .htm files

Posted by Rob Weir <ro...@apache.org>.
On Sun, Dec 16, 2012 at 2:47 PM, Andrea Pescetti <pe...@apache.org> wrote:
> I had a long discussion with Infra today trying to find out why a change I
> had applied was not appearing. Analyzing it, it turns out that we have a
> problem already visible on over 400 pages and related to .htm files (as
> opposed to .html files).
>
> Reproducing is easy:
> 1) Edit a .htm file, e.g., do this:
> http://svn.apache.org/viewvc/openoffice/ooo-site/trunk/content/pt/about/newsletter.htm?r1=1413471&r2=1422592&diff_format=h
>
> 2) Publish the changes and you get file duplication:
>
> http://www.openoffice.org/pt/about/newsletter.htm
> (the existing URL, ending in .htm, not updated)
>
> http://www.openoffice.org/pt/about/newsletter.html
> (a new URL, containing the fix)
>
> This silent change of URLs is quite scary and we already have 401
> "duplicate" pages. For other examples see
>
> http://www.openoffice.org/fr/Documentation/liens.htm
> http://www.openoffice.org/fr/Documentation/liens.html
>


When I build locally I see that input htm files are published as html
files.  But I don't see any duplications.  Maybe the duplicates are
just left over from earlier?


> or
>
> http://www.openoffice.org/ui/proposals/Readonly_mode.htm
> http://www.openoffice.org/ui/proposals/Readonly_mode.html
>
> Daniel Shahaf, who investigated the problem, suggests that we take a look at
> our path.pm.
>
> Looking at it, I think the place to start investigating is line 14 of
> http://svn.apache.org/viewvc/openoffice/ooo-site/trunk/lib/path.pm?revision=1413471&view=markup
> which seems to actually turn .htm files into .html files, but it's probably
> best that someone familiar with the CMS does the change, since I definitely
> don't want to break the website.
>
> Regards,
>   Andrea.