Skip to content

Add rel=canonical support #74

@briansmith

Description

@briansmith

Please add the correct <link rel=canonical> link to each generated documentation page, when the crate provides a canonical documentation URL. This will help search engines disambiguate the docs.rs copies of the documentation and avoid docs.rs looking like a content farm or spam blog to search engines.

Activity

onur

onur commented on Oct 20, 2016

@onur
Member

Thanks for this awesome idea. I didn't know about rel=canonical.

I'll definitely add this, but I need to save original documentation links first. That is only thing I forgot to save when building crates.

added a commit that references this issue on Jan 24, 2017
jyn514

jyn514 commented on Nov 27, 2019

@jyn514
Member

@GuillaumeGomez is this something that should be done by docs.rs? It seems pretty easy to implement but I think it would make more sense to have a rel=canonical link in every page generated with rustdoc, not just the ones on docs.rs.

GuillaumeGomez

GuillaumeGomez commented on Nov 27, 2019

@GuillaumeGomez
Member

If I understand correctly, it's supposed to be used in the <head> part. However, I'm not sure if this'll be really useful here considering that all URLs are unique in our case... Also, the URL needs to include the domain name, so if we want to add it, I think it should be done on docs.rs side.

jyn514

jyn514 commented on Nov 27, 2019

@jyn514
Member

This is (correct me if I'm wrong) for documentation URLs that are explicitly given in Cargo.toml, not for different versions of the docs. See for example https://github.com/serde-rs/serde/blob/master/serde/Cargo.toml#L9. The domain name is given here, it's not necessarily the same site that's currently hosting the docs.

GuillaumeGomez

GuillaumeGomez commented on Nov 27, 2019

@GuillaumeGomez
Member

Oh in this case it makes sense to have canonical then I guess. But still on docs.rs side. :)

self-assigned this
on Nov 27, 2019
added and removed on Jun 27, 2020
workingjubilee

workingjubilee commented on Aug 27, 2020

@workingjubilee
Member

P-higher than medium maybe but not existentially critical so not high ( I mean I guess that's the important label? ), https://docs.rs/tracing/0.1.19/tracing existing punishes e.g. https://tracing.rs/tracing. This needs to be resolved sooner rather than later.

14 remaining items

removed their assignment
on Feb 15, 2022
jsha

jsha commented on Jul 14, 2022

@jsha
Contributor

As noted in #1438, lack of rel="canonical" seems to be hurting search results within docs.rs too. Given a plethora of identical content across various releases of a crate, Google chooses one effectively at random, which can cause a page from a newer version of a crate to be excluded in favor of that same page from an older version of a crate. I have a two-part proposal:

Whichever tag we emit, we would do it in docs.rs at page load time, not in rustdoc and not at doc build time. This allows us to include the tags even for old releases, and to remove them easily if they turn out not to have the effect we want.

jsha

jsha commented on Jul 20, 2022

@jsha
Contributor
  • For crates with a package.documentation that does not start with https://docs.rs, set no rel="canonical", and set <meta name="robots" content="noindex"> (https://developers.google.com/search/docs/advanced/crawling/block-indexing). This should prevent the docs.rs documentation from competing with self-hosted documentation for canonical status, while leaving the docs.rs pages available for users who navigate there directly.

An update on this: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls says:

Don't use noindex as a means to prevent selection of a canonical page. This directive is intended to exclude the page from the index, not to manage the choice of a canonical page.

Also, checking one of the examples above, https://tracing.rs has <meta name="robots" content="noindex"> in its source. Presumably the intent is that people should find the doc.rs page on search rather than the prerelease docs on https://tracing.rs. If the tracing crate set its documentation URL to https://tracing.rs (it doesn't), we would wind up in the tricky situation where neither https://tracing.rs nor https://docs.rs/tracing showed up in search indexes.

Maybe this isn't a big deal - I haven't surveyed the list of crates with documentation URLs that @pietroalbini thoughtfully provided to see how common it might be. But it makes me realize that adding noindex on docs.rs any time there is a documentation URL may be too aggressive.

syphar

syphar commented on Jul 20, 2022

@syphar
Member

Also, checking one of the examples above, https://tracing.rs has <meta name="robots" content="noindex"> in its source. Presumably the intent is that people should find the doc.rs page on search rather than the prerelease docs on https://tracing.rs. If the tracing crate set its documentation URL to https://tracing.rs (it doesn't), we would wind up in the tricky situation where neither https://tracing.rs nor https://docs.rs/tracing showed up in search indexes.

Maybe this isn't a big deal - I haven't surveyed the list of crates with documentation URLs that @pietroalbini thoughtfully provided to see how common it might be. But it makes me realize that adding noindex on docs.rs any time there is a documentation URL may be too aggressive.

When we don't want to exclude the docs.rs pages from google, even when we have a documentation-url, then we should also return a normal canonical URL to our latest version, right?

This would only be problematic if documentation-url also points to rustdoc content, would it?

syphar

syphar commented on Jul 20, 2022

@syphar
Member

so google can choose between the versioned pages on docs.rs?

jsha

jsha commented on Aug 9, 2022

@jsha
Contributor

An update from #1438 (comment):

There are a fairly large number of pages that are not getting the /latest/ treatment in Google's index because they have a documentation URL that points somewhere other than docs.rs, which means they don't have <link rel="canonical"> (as I proposed above, and implemented). One particularly notable effect is that an older version of a crate can have a non-docs.rs URL, which will mean the older version doesn't get <link rel="canonical">, and may itself get incorrectly selected as canonical.

I think the simplest solution is to apply <link rel="canonical"> to all versions of all crates, and not make an exception for crates that have their own doc URL. Here's my reasoning:

For crates that have a self-hosted doc URL, we can't just point rel="canonical" at that doc URL. The doc URL could contain versioned URLs (like docs.rs); it could contain unversioned URLs (if only one version is hosted at a time); or it could be generic high-level documentation that doesn't match URLs one-for-one. Without know which, we'll get it wrong a good chunk of the time.

Given that, there's nothing we can do unilaterally to boost the ranking of that doc URL. Instead, we should make a mechanism available for crate authors to say "I would prefer my self-hosted documentation to show up on Google instead of docs.rs' documentation." For instance, we could provide a package.metadata.docs.rs field, something like noindex = true. When that field is present on the latest version of a crate, docs.rs would render all versions of that crate with <meta name="robots" content="noindex">. Or we could use the package.metadata.docs.rs.canonical-url field that @pietroalbini proposed in 2020. Either way I think we need to treat the latest version of this metadata field as affecting all versions of the crate. And I think we should skip the banner.

Assuming folks here agree with that conclusion, we can uncouple the two issues: fixing canonicalization within docs.rs, and offering a noindex option so crates can choose to boost their off-docs.rs documentation.

jsha

jsha commented on Aug 31, 2022

@jsha
Contributor

With #1792 released, instances of "Duplicate without user-selected canonical" have decreased, and nearly all of them are of the form https://docs.rs/crate/cargo-bump/1.1.0. In other words crate pages. We should give the same <link rel="canonical"> treatment to crate pages, though it's not as crucial as for doc pages, since almost no-one goes to crate pages from search (4 out of 1000 top pages visited from search, according to Google Search Console).

syphar

syphar commented on Sep 1, 2022

@syphar
Member

This is awesome! thank you for driving this forward.

My feeling would be that we should add the canonical url to these crate-pages too, for the sake of completenes, and then close this issue.

jsha

jsha commented on Sep 16, 2022

@jsha
Contributor

nearly all of them are of the form https://docs.rs/crate/cargo-bump/1.1.0. In other words crate pages.

In #1829, with some help, I realized the common factor for these remaining duplicates is not just that they are crate pages. They are crate pages for binary crates, which have no docs. And Google is considering them duplicates of, e.g. https://docs.rs/cargo-bump, since there's a 302 (temporary) redirect from https://docs.rs/cargo-bump/ to https://docs.rs/crate/cargo-bump/1.1.0. We don't want that to change, since that redirect could change in the future if the binary crate adds docs. So this subset of URLs will just trigger duplicate detection forever, which is fine.

Spot-checking, about 50% of recently crawled duplicate URLs fall in that category, while about 45% fall in the category that will be fixed by #1829.

Meanwhile, for the overall problem, this graph is encouraging:

image

It shows "Duplicate without user-selected canonical" going from 1,158,432 to 862,124 over the course of about 85 days, for drop of 296,308 URLs or 3,485 per day. At this rate it will take about 247 days to go to zero, although in reality it will flatten out at some non-zero level eventually.

According to another page on the Search Console we get about 85k crawls per day, of which 75% is refresh and 25% is discovery.

There's another report on the Search Console for "Indexed pages" - those that made it through duplicate detection and will show up in search results. Here's a sample of recently crawled indexed pages:

https://docs.rs/harfbuzz-sys/0.1.15/harfbuzz_sys/fn.hb_buffer_create.html
https://docs.rs/ux_serde/0.2.0/ux_serde/struct.i103.html
https://docs.rs/nom/5.1.2/nom/macro.flat_map.html
https://docs.rs/opentelemetry/0.16.0/src/opentelemetry/metrics/value_recorder.rs.html
https://docs.rs/druid/0.6.0/druid/struct.FileDialogOptions.html
https://docs.rs/cookie/0.13.1/src/cookie/draft.rs.html
https://docs.rs/medea/latest/medea/
https://docs.rs/crate/gtk/0.4.0
https://docs.rs/seahorse/0.7.1/seahorse/struct.Command.html
https://docs.rs/winapi/0.3.7/winapi/um/wincrypt/constant.CERT_NOT_BEFORE_FILETIME_PROP_ID.html
https://docs.rs/ibm_db/0.1.6/?search=_IMAGE_THUNK_DATA64
https://docs.rs/axum/latest/axum/body/index.html

As you can see, a surprising number of versioned URLs are still getting indexed instead of dup'ed out by the canonical tag (659 out of 1000). Inspecting these URLs in the search console shows that Google is aware of the canonical tag but disregarded it and considers the versioned URL canonical:

image

The common factor in these is that they had a referring link from a versioned page. I suspect Google is weighting the existing links more heavily than the canonical tag when making the decision. Probably this effect will lessen with time as more of the existing pages are recrawled and canonicalized.

We might be able to speed up the process by bumping the <lastmod> tag in our sitemaps. Right now that date reflects the most recent build for the crate, and a lot of crates are very infrequently built, which leads Google not to crawl their docs. We could bring all the <lastmod> tags up to the date we added canonical tags.

jsha

jsha commented on Mar 14, 2023

@jsha
Contributor

I've closed out #1438. For duplicates within docs.rs, we solved the problem by setting noindex on outdated versions.

I'm also closing out this issue. Based on the discussion above, automatically setting a <link rel="canonical"> based on a crate's documentation field won't work, because we don't know what format the documentation is in and how to map individual pages to it. If we want to solve the original problem (docs.rs causes self-hosted docs to not appear in search), we'll do it by adding an explicit mechanism for crates to indicate noindex for their pages on docs.rs.

andrewtj

andrewtj commented on Mar 17, 2023

@andrewtj

If we want to solve the original problem (docs.rs causes self-hosted docs to not appear in search), we'll do it by adding an explicit mechanism for crates to indicate noindex for their pages on docs.rs.

Would including a snippet via rustdoc's --html-in-header feature work?

Cargo.toml:

[package.metadata.docs.rs]
rustdoc-args = ["--html-in-header", "noindex.html"]

noindex.html:

<meta name="robots" content="noindex">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @briansmith@andrewtj@jsha@onur@syphar

        Issue actions

          Add `rel=canonical` support · Issue #74 · rust-lang/docs.rs