-
Notifications
You must be signed in to change notification settings - Fork 210
Closed
Labels
Description
Please add the correct <link rel=canonical>
link to each generated documentation page, when the crate provides a canonical documentation URL. This will help search engines disambiguate the docs.rs copies of the documentation and avoid docs.rs looking like a content farm or spam blog to search engines.
felixc, kornelski, tmandry and williamdes
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
onur commentedon Oct 20, 2016
Thanks for this awesome idea. I didn't know about
rel=canonical
.I'll definitely add this, but I need to save original documentation links first. That is only thing I forgot to save when building crates.
Save original documentation url
jyn514 commentedon Nov 27, 2019
@GuillaumeGomez is this something that should be done by docs.rs? It seems pretty easy to implement but I think it would make more sense to have a
rel=canonical
link in every page generated with rustdoc, not just the ones on docs.rs.GuillaumeGomez commentedon Nov 27, 2019
If I understand correctly, it's supposed to be used in the
<head>
part. However, I'm not sure if this'll be really useful here considering that all URLs are unique in our case... Also, the URL needs to include the domain name, so if we want to add it, I think it should be done on docs.rs side.jyn514 commentedon Nov 27, 2019
This is (correct me if I'm wrong) for documentation URLs that are explicitly given in Cargo.toml, not for different versions of the docs. See for example https://github.com/serde-rs/serde/blob/master/serde/Cargo.toml#L9. The domain name is given here, it's not necessarily the same site that's currently hosting the docs.
GuillaumeGomez commentedon Nov 27, 2019
Oh in this case it makes sense to have canonical then I guess. But still on docs.rs side. :)
workingjubilee commentedon Aug 27, 2020
P-higher than medium maybe but not existentially critical so not high ( I mean I guess that's the important label? ),
https://docs.rs/tracing/0.1.19/tracing
existing punishes e.g.https://tracing.rs/tracing
. This needs to be resolved sooner rather than later.14 remaining items
jsha commentedon Jul 14, 2022
As noted in #1438, lack of rel="canonical" seems to be hurting search results within docs.rs too. Given a plethora of identical content across various releases of a crate, Google chooses one effectively at random, which can cause a page from a newer version of a crate to be excluded in favor of that same page from an older version of a crate. I have a two-part proposal:
For crates with no package.documentation, or where package.documentation starts with
https://docs.rs/
, setrel="canonical"
to the equivalent page on docs.rs with/latest/
in the URL. For example https://docs.rs/regex/1.5.5/regex/struct.CaptureLocations.html would point to https://docs.rs/regex/latest/regex/struct.CaptureLocations.html.For crates with a package.documentation that does not start with
https://docs.rs
, set norel="canonical"
, and set<meta name="robots" content="noindex">
(https://developers.google.com/search/docs/advanced/crawling/block-indexing). This should prevent the docs.rs documentation from competing with self-hosted documentation for canonical status, while leaving the docs.rs pages available for users who navigate there directly.Whichever tag we emit, we would do it in docs.rs at page load time, not in rustdoc and not at doc build time. This allows us to include the tags even for old releases, and to remove them easily if they turn out not to have the effect we want.
jsha commentedon Jul 20, 2022
An update on this: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls says:
Also, checking one of the examples above, https://tracing.rs has
<meta name="robots" content="noindex">
in its source. Presumably the intent is that people should find the doc.rs page on search rather than the prerelease docs on https://tracing.rs. If thetracing
crate set its documentation URL to https://tracing.rs (it doesn't), we would wind up in the tricky situation where neither https://tracing.rs nor https://docs.rs/tracing showed up in search indexes.Maybe this isn't a big deal - I haven't surveyed the list of crates with documentation URLs that @pietroalbini thoughtfully provided to see how common it might be. But it makes me realize that adding
noindex
on docs.rs any time there is a documentation URL may be too aggressive.syphar commentedon Jul 20, 2022
When we don't want to exclude the docs.rs pages from google, even when we have a documentation-url, then we should also return a normal canonical URL to our latest version, right?
This would only be problematic if documentation-url also points to rustdoc content, would it?
syphar commentedon Jul 20, 2022
so google can choose between the versioned pages on docs.rs?
jsha commentedon Aug 9, 2022
An update from #1438 (comment):
There are a fairly large number of pages that are not getting the
/latest/
treatment in Google's index because they have a documentation URL that points somewhere other than docs.rs, which means they don't have<link rel="canonical">
(as I proposed above, and implemented). One particularly notable effect is that an older version of a crate can have a non-docs.rs URL, which will mean the older version doesn't get<link rel="canonical">
, and may itself get incorrectly selected as canonical.I think the simplest solution is to apply
<link rel="canonical">
to all versions of all crates, and not make an exception for crates that have their own doc URL. Here's my reasoning:For crates that have a self-hosted doc URL, we can't just point
rel="canonical"
at that doc URL. The doc URL could contain versioned URLs (like docs.rs); it could contain unversioned URLs (if only one version is hosted at a time); or it could be generic high-level documentation that doesn't match URLs one-for-one. Without know which, we'll get it wrong a good chunk of the time.Given that, there's nothing we can do unilaterally to boost the ranking of that doc URL. Instead, we should make a mechanism available for crate authors to say "I would prefer my self-hosted documentation to show up on Google instead of docs.rs' documentation." For instance, we could provide a package.metadata.docs.rs field, something like
noindex = true
. When that field is present on the latest version of a crate, docs.rs would render all versions of that crate with<meta name="robots" content="noindex">
. Or we could use thepackage.metadata.docs.rs.canonical-url
field that @pietroalbini proposed in 2020. Either way I think we need to treat the latest version of this metadata field as affecting all versions of the crate. And I think we should skip the banner.Assuming folks here agree with that conclusion, we can uncouple the two issues: fixing canonicalization within docs.rs, and offering a noindex option so crates can choose to boost their off-docs.rs documentation.
jsha commentedon Aug 31, 2022
With #1792 released, instances of "Duplicate without user-selected canonical" have decreased, and nearly all of them are of the form
https://docs.rs/crate/cargo-bump/1.1.0
. In other words crate pages. We should give the same<link rel="canonical">
treatment to crate pages, though it's not as crucial as for doc pages, since almost no-one goes to crate pages from search (4 out of 1000 top pages visited from search, according to Google Search Console).syphar commentedon Sep 1, 2022
This is awesome! thank you for driving this forward.
My feeling would be that we should add the canonical url to these crate-pages too, for the sake of completenes, and then close this issue.
jsha commentedon Sep 16, 2022
In #1829, with some help, I realized the common factor for these remaining duplicates is not just that they are crate pages. They are crate pages for binary crates, which have no docs. And Google is considering them duplicates of, e.g. https://docs.rs/cargo-bump, since there's a 302 (temporary) redirect from https://docs.rs/cargo-bump/ to https://docs.rs/crate/cargo-bump/1.1.0. We don't want that to change, since that redirect could change in the future if the binary crate adds docs. So this subset of URLs will just trigger duplicate detection forever, which is fine.
Spot-checking, about 50% of recently crawled duplicate URLs fall in that category, while about 45% fall in the category that will be fixed by #1829.
Meanwhile, for the overall problem, this graph is encouraging:
It shows "Duplicate without user-selected canonical" going from 1,158,432 to 862,124 over the course of about 85 days, for drop of 296,308 URLs or 3,485 per day. At this rate it will take about 247 days to go to zero, although in reality it will flatten out at some non-zero level eventually.
According to another page on the Search Console we get about 85k crawls per day, of which 75% is refresh and 25% is discovery.
There's another report on the Search Console for "Indexed pages" - those that made it through duplicate detection and will show up in search results. Here's a sample of recently crawled indexed pages:
https://docs.rs/harfbuzz-sys/0.1.15/harfbuzz_sys/fn.hb_buffer_create.html
https://docs.rs/ux_serde/0.2.0/ux_serde/struct.i103.html
https://docs.rs/nom/5.1.2/nom/macro.flat_map.html
https://docs.rs/opentelemetry/0.16.0/src/opentelemetry/metrics/value_recorder.rs.html
https://docs.rs/druid/0.6.0/druid/struct.FileDialogOptions.html
https://docs.rs/cookie/0.13.1/src/cookie/draft.rs.html
https://docs.rs/medea/latest/medea/
https://docs.rs/crate/gtk/0.4.0
https://docs.rs/seahorse/0.7.1/seahorse/struct.Command.html
https://docs.rs/winapi/0.3.7/winapi/um/wincrypt/constant.CERT_NOT_BEFORE_FILETIME_PROP_ID.html
https://docs.rs/ibm_db/0.1.6/?search=_IMAGE_THUNK_DATA64
https://docs.rs/axum/latest/axum/body/index.html
As you can see, a surprising number of versioned URLs are still getting indexed instead of dup'ed out by the canonical tag (659 out of 1000). Inspecting these URLs in the search console shows that Google is aware of the canonical tag but disregarded it and considers the versioned URL canonical:
The common factor in these is that they had a referring link from a versioned page. I suspect Google is weighting the existing links more heavily than the canonical tag when making the decision. Probably this effect will lessen with time as more of the existing pages are recrawled and canonicalized.
We might be able to speed up the process by bumping the
<lastmod>
tag in our sitemaps. Right now that date reflects the most recent build for the crate, and a lot of crates are very infrequently built, which leads Google not to crawl their docs. We could bring all the<lastmod>
tags up to the date we added canonical tags.jsha commentedon Mar 14, 2023
I've closed out #1438. For duplicates within docs.rs, we solved the problem by setting noindex on outdated versions.
I'm also closing out this issue. Based on the discussion above, automatically setting a
<link rel="canonical">
based on a crate's documentation field won't work, because we don't know what format the documentation is in and how to map individual pages to it. If we want to solve the original problem (docs.rs causes self-hosted docs to not appear in search), we'll do it by adding an explicit mechanism for crates to indicate noindex for their pages on docs.rs.andrewtj commentedon Mar 17, 2023
Would including a snippet via rustdoc's
--html-in-header
feature work?Cargo.toml:
noindex.html: