Bug report
The urllib.robotparser module implements an unofficial standard originally specified in http://www.robotstxt.org/orig.html, with some additions (support not only "disallow", but also "allow" rules, support additional fields "crawl-delay", "request-rate" and "sitemap"). The practice of using robots.txt files differs significantly from the original specification. The new standard RFC 9309 was published in 2022, but drafts were used as a de facto standard for many years before that. There are several open issues regarding the module's inconsistency with current practices. These can be addressed separately, but to finally resolve the issue, we need to implement support for RFC 9309. I consider this not a feature request, but a bug fix, because incorrect support of robots.txt files can make Python code that uses robotparser malicious.
See also https://discuss.python.org/t/about-robotparser/103683
Linked PRs
Bug report
The
urllib.robotparsermodule implements an unofficial standard originally specified in http://www.robotstxt.org/orig.html, with some additions (support not only "disallow", but also "allow" rules, support additional fields "crawl-delay", "request-rate" and "sitemap"). The practice of using robots.txt files differs significantly from the original specification. The new standard RFC 9309 was published in 2022, but drafts were used as a de facto standard for many years before that. There are several open issues regarding the module's inconsistency with current practices. These can be addressed separately, but to finally resolve the issue, we need to implement support for RFC 9309. I consider this not a feature request, but a bug fix, because incorrect support of robots.txt files can make Python code that usesrobotparsermalicious.See also https://discuss.python.org/t/about-robotparser/103683
Linked PRs