From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS54825 139.178.80.0/21 X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id DE2BE1F44D for ; Wed, 3 Apr 2024 20:58:13 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (1024-bit key; unprotected) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.a=rsa-sha256 header.s=korg header.b=ImwHdnX/; dkim-atps=neutral Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 1FAFA612C4; Wed, 3 Apr 2024 20:58:13 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9EA31C433F1; Wed, 3 Apr 2024 20:58:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1712177892; bh=+av/gP9tYLdfEtXT9f8mP6yomNn8YCEBmfdR9YAJFOc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=ImwHdnX/jUuFumXgXVST54s4XQgmm0pmYWfiEGXaEQfolo2F8SRSdDRdx0KTeodNY 90SvtXVBLvwkM6gNLT1JjmJwMq383V6yUPkASz0qx+/kOTcon71auvxl9ogDZdbwK3 G16u+ouvJluMfp5V2UI0fKxus6yv0uEw8g5vMRCI= Date: Wed, 3 Apr 2024 16:58:11 -0400 From: Konstantin Ryabitsev To: Eric Wong Cc: meta@public-inbox.org Subject: Re: sample robots.txt to reduce WWW load Message-ID: <20240403-able-meticulous-narwhal-aeea54@lemur> References: <20240401132145.M567778@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240401132145.M567778@dcvr> List-Id: On Mon, Apr 01, 2024 at 01:21:45PM +0000, Eric Wong wrote: > Performance is still slow, and crawler traffic patterns tend to > do bad things with caches at all levels, so I've regretfully had > to experiment with robots.txt to mitigate performance problems. This has been the source of grief for us, because aggressive bots don't appear to be paying any attention to robots.txt, and they are fudging their user-agent string to pretend to be a regular browser. I am dealing with one that is hammering us from China Mobile IP ranges and is currently trying to download every possible snapshot of torvalds/linux, while pretending to be various versions of Chrome. So, while I welcome having a robots.txt recommendation, it kinda assumes that robots will actually play nice and won't try to suck down as much as possible as quickly as possible for training some LLM-du-jour. /end rant -K