Skip to content

Set user agent to avoid errors in link_checker and load_data #950

@lukehsiao

Description

@lukehsiao

Bug Report

zola check currently reports errors for links where the server returns an error (e.g., 403, 400) when there is not a user agent in the request headers. This is expected behavior, as the current link_checker doesn't set any. Can we allow the link checker to set a user agent, and/or provide a default zola user agent?

Environment

Ubuntu 18.04.4

Zola version:
v0.10.0

Expected Behavior

Tell us what should have happened.

Some servers return errors when the user agent header is missing. For example, when running the link_checker on a URL such as https://arxiv.org/abs/1906.01113, the link_checker will report a 403 and declare this as a dead link. This can be seen using an example test case:

components/link_checker/src/lib.rs

    #[test]
    fn user_agent_test() {
        let url = "https://arxiv.org/abs/1906.01113";

        let res = check_url(url, &LinkChecker::default());
        assert!(res.is_valid());
        assert!(res.code.is_some());
        assert!(res.error.is_none());
    }

This same test case will pass if a user agent is included, e.g.:

pub fn check_url(url: &str, config: &LinkChecker) -> LinkResult {
    {
        let guard = LINKS.read().unwrap();
        if let Some(res) = guard.get(url) {
            return res.clone();
        }
    }

    let mut headers = HeaderMap::new();
    headers.insert(ACCEPT, "text/html".parse().unwrap());
    headers.append(ACCEPT, "*/*".parse().unwrap());
    headers.append(USER_AGENT, "zola/0.10.0 link_checker".parse().unwrap());
...

Without a USER_AGENT, the test will fail.

We could mitigate this issue by:

  • Setting a default user agent (such as a a zola-specific user agent shown above)
  • Allowing users to specify a user-agent via some configuration

For a default user-agent, we probably do not want a hard-coded string, and rather could just follow the reqwest example:

https://docs.rs/reqwest/0.10.1/reqwest/struct.ClientBuilder.html#method.user_agent

// Name your user agent after your app?
static APP_USER_AGENT: &str = concat!(
    env!("CARGO_PKG_NAME"),
    "/",
    env!("CARGO_PKG_VERSION"),
);

let client = reqwest::Client::builder()
    .user_agent(APP_USER_AGENT)
    .build()?;

Some other example URLs which return 400/403s without a user agent:

Metadata

Metadata

Assignees

No one assigned

    Labels

    done in prAlready done in a PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions