From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_DKIM_INVALID, T_RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 9602B1F404 for ; Sun, 4 Feb 2018 04:43:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750798AbeBDEnC (ORCPT ); Sat, 3 Feb 2018 23:43:02 -0500 Received: from mail-qt0-f175.google.com ([209.85.216.175]:33403 "EHLO mail-qt0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750744AbeBDEnA (ORCPT ); Sat, 3 Feb 2018 23:43:00 -0500 Received: by mail-qt0-f175.google.com with SMTP id d8so35812833qtm.0 for ; Sat, 03 Feb 2018 20:43:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-transfer-encoding; bh=k2ILD/ohfil8uitDGnjMEdBPu3fRO16g2WEDnHG4Uzo=; b=FZiRq5++cC043V1/ML+zdolUKGqUcn+W/hTiGslkyCuMSj2LrbPIMFd/KgpgO+OEtn GW+jfLXh5VV9dscw83rQgVYpqCagnVIGadK/bXIGvtzLddnSgfLJMX1mgo/Y+P2Cajn3 uURVW7ZtIsv3y8BcCsDmri3V5E4ncJjE7Y6VCOA7gdwg/lTE1jsecg9VFwyQARt+qJwF X9XFYS+umLBM4COlfJvA0JavuXXMazxrQCaoLehDE2S5/H6pzZzpa+P965Lex8VlLWKW KczH1brqrrgIOBgMyHWQkfq/qrFAaQBw+6klZ0JBYLLwVcObO/84K0yksmBE9rLhlLkl ukMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-transfer-encoding; bh=k2ILD/ohfil8uitDGnjMEdBPu3fRO16g2WEDnHG4Uzo=; b=bCagQtNSa6lP+coM7kZP2Xl224rGpjRFGEfhLDDKf+KBQfvxGifHrFCHZ8lrx9u8b1 fZfJWW6wLicDR58t7mjUzrhTAJkgExPz0P3UVCyqGP42GqqRtqEtTr6ywJIO1WdU4XgU P/pLA2qd/i8vLN1/4+8UpDnHogHvyuP7K7CwLoMFQQbky/TIU5wD4HTKBJSDuuTPQc20 9svFWI7gwAK4VaZWwKtfERYwbGbM4HuCAzSrE2ju0tYY+9d+n9WVG7FgaWVdsf2uCtnc FPO8CMFCLTQjAr3zsW8Of1S6lkOrn44XSXH/nrRU6AgB5krOPr4lLEOUSXFn7YA6fmgC POXg== X-Gm-Message-State: APf1xPAG6v3Imu8qDNATgxudmzdNAn7YEIiRBPNRd1u0rLUvWD5ZBetU R/CcmDqoIhCITlh6BSH5hSJxp+He6jqyGUFK328= X-Google-Smtp-Source: AH8x225O00B4pu2p6lY5BJF0rUsKSC96QhndDMj1apKMui7Yh4PMssiB0cZPhY78c/KcLDjkybszjT/Dc5IKb0Bwjnw= X-Received: by 10.200.15.194 with SMTP id f2mr5004628qtk.237.1517719379566; Sat, 03 Feb 2018 20:42:59 -0800 (PST) MIME-Version: 1.0 Received: by 10.12.175.239 with HTTP; Sat, 3 Feb 2018 20:42:58 -0800 (PST) In-Reply-To: References: <20180130232533.25846-1-newren@gmail.com> <20180130232533.25846-20-newren@gmail.com> From: Eric Sunshine Date: Sat, 3 Feb 2018 23:42:58 -0500 X-Google-Sender-Auth: 9nzSZonL3evnclNYSg5WODIhstk Message-ID: Subject: Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames() To: Elijah Newren Cc: Stefan Beller , Junio C Hamano , git , =?UTF-8?Q?SZEDER_G=C3=A1bor?= , Jonathan Nieder , Jeff King Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Sat, Feb 3, 2018 at 9:04 PM, Elijah Newren wrote: > On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren wrote: >> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller wrote= : >>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren wrote= : >>>> + while (*--end_of_new =3D=3D *--end_of_old && >>>> + end_of_old !=3D old_path && >>>> + end_of_new !=3D new_path) >>>> + ; /* Do nothing; all in the while loop */ >>> >>> Assuming many repos are UTF8 (including in their paths), >>> how does this work with display characters longer than one char? >>> It should be fine as we cut at the slash? >> >> Can UTF-8 characters, other than '/', have a byte whose value matches >> (unsigned char)('/')? If so, then I'll need to figure out how to do >> utf-8 character parsing. Anyone have pointers? > > Well, after digging around for a while, I found this claim on the > Wikipedia page for UTF-8: > > Since ASCII bytes do not occur when encoding non-ASCII code points > into UTF-8, UTF-8 is safe to use within most programming and document > languages that interpret certain ASCII characters in a special way, > such as "/" in filenames, "\" in escape sequences, and "%" in printf. > > So, unless I'm reading something wrong here, I think that means this > code is just fine as it is. You're reading it correctly. Unicode values greater than \U007f encoded with UTF-8 will never contain bytes which can be confused with any 7-bit ASCII character. It's possible that Stefan was thinking of "combining characters"[1] which may be "precomposed" and "decomposed"[2], but which appear the same when rendered. For instance, "=C3=B6" might be a single Unicode codepoint or two codepoints, such as "o" combined with a diaeresis. It's a potential problem if you're comparing, byte by byte, two filenames which look the same. However, Git takes pains[3] to avoid this problem by ensuring (if possible) that filenames are precomposed within Git even if they happen to be decomposed on the actual filesystem. So, most likely, your code is okay as-is. [1]: https://en.wikipedia.org/wiki/Combining_character [2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic) [3]: https://github.com/git/git/blob/master/compat/precompose_utf8.c