You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2022/05/09 17:42:11 UTC

[GitHub] [solr] janhoy opened a new pull request, #846: Make 'latest' remain in URL instead of `9_0`

janhoy opened a new pull request, #846:
URL: https://github.com/apache/solr/pull/846

   Spinoff from #77 - do not rewrite `latest` to `9_0`, but opposite, so that people are encouraged to sharing `latest` links.
   Note that it will still be possible to share an explicit link to 9.0 version, that will be sure to route to the 9.0 guide. Just that the 'default' will be `latest` when working with the latest version.
   
   This may perhaps also help in boosting the PageRank of `latest` links in search engines?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] uschindler commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
uschindler commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1123319707

   > Antora supports canonical URL header: https://docs.antora.org/antora/latest/playbook/site-url/#canonical-url which is good news, so if we redirect to "latest", but canonical remains "9_0" then we could be good?
   
   If the canonical URL is always "latest"  (as described in the documentation), then Google would forget all old versions and only show links to latest. Thats actually a good thing and would solve our problems. Unfortunately we should maybe think of patching all old pages with a canonical link. Or much better instead of patching, we could add a HTTP "Link:" header (see Google Docs above) to the `.htaccess` where we maybe link all 8.x pages to latest 8.11 refguide on the HTTP level. Same for 7.x and 6.x. This would at least remove all variants from google except the latets version of each major release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] magibney commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
magibney commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1125007411

   I realize this PR is merged (and thanks!), but I have a couple of questions that follow logically on the conversation here, so:
   1. Noticing that we don't have (and haven't historically had, I think) an old-style `robots.txt`, I wonder: is old-style `robots.txt` completely obviated? i.e., should we not bother having one?
   2. Sitemap.xml is being generated I think, and is [present in nightlies](https://nightlies.apache.org/solr/draft-guides/solr-reference-guide-nightly/sitemap.xml). But it doesn't appear to be accessible on the main site. I think it was previously present. Is there still a purpose served by sitemap.xml? Currently the nightlies version looks like it points to all (antora) versions -- I'm not sure whether we'd want to pare down the referenced pages to make sitemap.xml a proper complement to "canonical, no-index/no-follow/no-archive" approach taken by this PR?
   3. If sitemap.xml is still relevant and we want to make it accessible, I think the sitemap spec calls for sitemap.xml to be referenced from an old-style robots.txt ... I'm not aware of other/newer approaches to referencing sitemap.xml.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] janhoy commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
janhoy commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1123949073

   I have a list of pages that once existed but no longer does in 9.0:
   https://github.com/apache/solr/pull/596/files#diff-ebf3a521b24b4139995e9e70b7aeffc202df3152e84f4dd46d17d6649f343834R97


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] uschindler commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
uschindler commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1124192880

   Ich think you should be able to do some tests with `curl -I testurl` on the staging site.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] janhoy commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
janhoy commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1122102781

   Ideally I'd also want that both '9_0' and 'latest' will work, but I cannot see that as a choice at https://docs.antora.org/antora/latest/playbook/urls-latest-version-segment-strategy/#key?
   
   Antora supports canonical URL header: https://docs.antora.org/antora/latest/playbook/site-url/#canonical-url which is good news, so if we redirect to "latest", but canonical remains "9_0" then we could be good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] HoustonPutman commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1124140943

   > Instead of an old-style robots.txt we may also use a `<locationMatch ^/guide/(6|7|8)_>` to the htaccess with `addHeader "X-Robots-Tag: noindex,nofollow,noarchive"` (noindex should be enough, we can still allow Google to follow links or archive).
   
   Yes I was actually about to start implementing this. I think it's the way to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] HoustonPutman commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1124170974

   Ok I have this: https://github.com/apache/solr/pull/596/commits/78ecec9d1b03b9f4cd2b958cf326899733d5a7b8
   
   It'll be an absolute pain to test, and I'm sure it doesn't work out-of-the box. But it's not required for the 9.0 release, so we can tinker with it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] uschindler commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
uschindler commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1124003233

   A robots.txt to hide the old releases looks like a good idea. We can just link all URL prefixes and we're done. Exlicitely allowing some older pages could also be done.
   
   Instead of an old-style robots.txt we may also use a `<locationMatch ^/guide/(6|7|8)_>` to the htaccess with {{addHeader "X-Robots-Tag: noindex,nofollow,noarchive"}} (noindex should be enough, we can still allow Google to follow links or archive).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] uschindler commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
uschindler commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1121641300

   I can't really give a review on this PR as I don't know Antorra.
   
   If you want by comment: I don't think this will change Google's ranking, there are pros and cons:
   - have in a stable "latest" version link is good because longer living pages may appear on top of serach results, so people will use them. But there may also be teh problem of older links disappearing von Google, because they are no longer referenced: When they are live they are redirceted to "latest". Once they are no longer alive they are invisible, unless we link them explicit.
   - always redirecting to "latest" seems bad to me, as it makes it impossible to add permlinks. Or is there the possibility to get some "permlink" button on each page? So somebody citing a specific page can make a persistent link?
   
   So I am mixed feelings. As coming from "science" where persistent URLs for each verison are important, I tend to think that redirecting to "latest" is not best idea. From a business person opinion of course linking to latest is fine.
   
   If both works: No redirect and both pages are visible next to each other with separate URLs, I would be happy. But the page should have a "canonic url" meta header to inform Google about the duplicates and which version is the one to "bookmark" (the versioned one).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] HoustonPutman commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1123943089

   > Maybe do it the other way round: If the canonical URL is always "latest" (as described in the documentation), then Google would forget all old versions and only show links to latest. Thats actually a good thing and would solve our problems.
   
   +1, we want `latest` to be in the google results.
   
   I have tested this, and the current logic will set the canonical link to `latest`, not `9_0`. The version selection tool also links to the correct name of the page in each version, [if the page has been renamed](https://docs.antora.org/antora/latest/page/page-aliases/#page-aliases-attribute). So this is exactly the logic that we want.
   
   > * always redirecting to "latest" seems bad to me, as it makes it impossible to add permlinks. Or is there the possibility to get some "permlink" button on each page? So somebody citing a specific page can make a persistent link?
   
   We should certainly add this and it shouldn't be hard to do. I am often annoyed with the AWS docs, trying to link to the specific latest version.
   
   > We should maybe think of patching all old pages with a canonical link, too. Or much better instead of patching, we could add a HTTP "Link:" header (see Google Docs above) to the `.htaccess` where we maybe link all 8.x pages to latest 8.11 refguide on the HTTP level. Same for 7.x and 6.x. This would at least remove all variants from google except the latets version of each major release.
   
   We definitely need to do something about the Solr 6-8 releases. Generally only the pages that don't exist in `latest` (after redirects) should be indexed in Google. I think that will be hard to do in general, though we can go back and do it. In my opinion, we can probably do a blanket `robots` file that says don't index the old ref-guide pages. It will make some information not searchable, but there should be very few pages that have been removed before 9.0
   
   Somewhat sane suggestion: We make a `robots.txt` that disallows scrapping all 6-8 ref guides. Then we create exceptions for the pages that have been removed in a certain version. So if a page (like autoscaling) was removed in 9.0, we create an exception that allows scraping `old-guide/8_11/autoscaling.html`, if a page was removed in 8.5, we allow `old-guide/8_4/page.html`. There shouldn't be too many of these pages, so we can go through and add the exceptions manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] uschindler commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
uschindler commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1121642960

   See: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls?hl=en#define-canonical


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] janhoy commented on pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
janhoy commented on PR #846:
URL: https://github.com/apache/solr/pull/846#issuecomment-1124196500

   > Ich think you should be able to do some tests with `curl -I testurl` on the staging site and check for (non-)existence of header.
   
   I was able to test `.htaccess` locally during my PR effort with docker like this:
   ```bash
   docker run --rm --name httpd -p 8000:80 -v /Users/janhoy/git/solr-site/output:/usr/local/apache2/htdocs/ -v $(pwd)/my-httpd.conf:/usr/local/apache2/conf/httpd.conf httpd
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] janhoy merged pull request #846: Make 'latest' remain in URL instead of `9_0`

Posted by GitBox <gi...@apache.org>.
janhoy merged PR #846:
URL: https://github.com/apache/solr/pull/846


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org