From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 60B0E20C11 for ; Sun, 3 Dec 2017 18:48:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751823AbdLCSsH (ORCPT ); Sun, 3 Dec 2017 13:48:07 -0500 Received: from mail-wr0-f175.google.com ([209.85.128.175]:46454 "EHLO mail-wr0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751390AbdLCSsF (ORCPT ); Sun, 3 Dec 2017 13:48:05 -0500 Received: by mail-wr0-f175.google.com with SMTP id x49so14948083wrb.13 for ; Sun, 03 Dec 2017 10:48:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=e9z62o+fXzX5jpdmazJEo4P1hOMu1m9lrr/8iVxvE4s=; b=Bj5Q+pWy7cU4Np0bl6YmxlqN+oniaNoVoqFTDBS0KRQp+tlpFUTQ+evm0tcV4Oddkr vnVfG7VdmZIstDEd1QdviZ4nxE6fObHru8NOy81+wXILcWD7Q8xq/rNFbK/CRLfFH6yG P5fzn2kQZgPptNzPZHikXcsaSWb1j7qGtvjlklWXLg7dqgZwlbpa5F2LXOCu9jXTHjTt H4u6KjzPXQlMCHktUPMG+ABubx5Wv04HI9YBmN5qVyzNtRwkbaY4d2q9jLYeCxN8mJyN Cn5dvlh2N/9ykMmzCWYabSsaMQfNQ2N2WTdSkB56n65m35EIJrLDysstHOR1cNqWmjAS zKLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=e9z62o+fXzX5jpdmazJEo4P1hOMu1m9lrr/8iVxvE4s=; b=OcIrf9OXqu8OkaY60RhaS8jldGUjQyaQX4p1WUKom8Ds9jKN2ebM7IHGcLhsSeqexk GDFtOpEQbVThZEfxCyDgDZNG6eQtoZ0aP9XSrSQhhwFWiHnIr+DAUO8cK/UUA4eZmN4r XmXT7gmg75+1pm+TIp7EpDiTcVqsMlZPw2PUmz4M2QAtXkRdBARpUUOpS+nR9lBF9TyE sP/JwKh1r1psvLJBgVJnEsAbh1esXDSl95ToTMcOsSJCEzG9TjaMPbZluD+eJKN8yh5K 5xv8I71qU/7co4s3Ehj1cbmh3fnMqvDK8ZyNzSBtXQrsunpD5FiKdrvw3mXox6z2qbEo Ymcg== X-Gm-Message-State: AJaThX5Pk694i+x9jKRnkkOzqDGR6GEtjlGkEdiARflWMxZ7gvlKbHhv QFY8ONhMjImXBCZyhhbSco4= X-Google-Smtp-Source: AGs4zMahq32E63uXAEACeCc9w95XTXhYb9vzhfR99EdySliRHhI6z68AwwJosYFXRPEAg/l8J/7QnA== X-Received: by 10.223.189.5 with SMTP id j5mr10382527wrh.172.1512326884243; Sun, 03 Dec 2017 10:48:04 -0800 (PST) Received: from slxbook4.fritz.box (p5DDB6E6A.dip0.t-ipconnect.de. [93.219.110.106]) by smtp.gmail.com with ESMTPSA id l142sm6301207wmb.43.2017.12.03.10.48.02 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 03 Dec 2017 10:48:03 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: RFC: Native clean/smudge filter for UTF-16 files From: Lars Schneider In-Reply-To: <20171124180401.GB29190@sigill> Date: Sun, 3 Dec 2017 19:48:01 +0100 Cc: Git List , Junio C Hamano , Johannes Schindelin , Shawn Pearce , Jonathan Nieder , Jakub Narebski Content-Transfer-Encoding: quoted-printable Message-Id: <759F0C3A-8C46-4685-BB27-6D508B26BB49@gmail.com> References: <20171124180401.GB29190@sigill> To: Jeff King X-Mailer: Apple Mail (2.3124) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org > On 24 Nov 2017, at 19:04, Jeff King wrote: >=20 > On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote: >=20 >> Alternatively, I could add a native attribute to Git that translates = UTF-16=20 >> to UTF-8 and back. A conversion function is already available in = "mingw.h" [3] >> on Windows. Limiting this feature to Windows wouldn't be a problem = from my >> point of view as UTF-16 is only relevant on Windows anyways. The = attribute=20 >> could look like this: >>=20 >> *.txt text encoding=3Dutf-16 >>=20 >> There was a previous discussion on the topic and Jonathan already = suggested >> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" = attribute >> is already present but, as far as I can tell, is only used by the git = gui >> for viewing [5]. >=20 > I would not want to see a proliferation of built-in filters, but it > really seems like text-encoding conversion is a broad and practical = one > that many people might benefit from. So just like line-ending > conversion, which _could_ be done by generic filters, it makes sense = to > me to support it natively for speed and simplicity. >=20 >> Do you think a patch that converts UTF-16 files to UTF-8 via an = attribute >> "encoding=3Dutf-16" on Windows would have a chance to get accepted? >=20 > You haven't fully specified the semantics here, so let me sketch out > what I think it ought to look like: Thank you :-) > - declare utf8 the "canonical" in-repo representation, just as we have > declared LF for line endings (alternatively this could be > configurable, but if we can get away with declaring utf8 the one = true > encoding, that cuts out a lot of corner cases). 100% agree on UTF-8 as canonical representation and I wouldn't make=20 that configurable as it would open a can of worms. > - if core.convertEncoding is true, then for any file with an > encoding=3Dfoo attribute, internally run iconv(foo, utf8) in > convert_to_git(), and likewise iconv(utf8, foo) in > convert_to_working_tree. >=20 > - I'm not sure if core.convertEncoding should be enabled by default. = If > it's a noop as long as there's no encoding attribute, then it's > probably fine. But I would not want accidental conversion or any > slowdown for the common case that the user wants no conversion. I think we should mimic the behavior of "eol=3Dcrlf/lf" attribute. AFAIK whenever I set "*.ext text eol=3Dcrlf", then I can be sure the=20 file is checked out with CRLF independent of any of my local config settings. Isn't that correct? I would expect a similar behavior if "*.ext text encoding=3Dutf16" is set. Wouldn't that mean that we do not need a "core.convertEncoding" config? I do 100% agree that it must be a noop if there is no encoding=20 attribute present. > - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". = I > don't think people consistently have all utf-16 files (the way they > might have all CRLF files) rather a few files that must be utf-16. Agreed! > - I have actually seen two types of utf-16 in git repos in the wild: > files which really must be utf-16 (because some tool demands it) and > files which happen to be utf-16, but could just as easily be utf-8 > (and the user simply does not notice and commits utf-16, but doesn't > realize it until much later when their diffs are unreadable). >=20 > For the first case, the "encoding" thing above would work fine. For > the second case, in theory we could have an option that takes any > file with a "text" attribute and no "encoding" attribute, and > converts it to utf-8. TBH I would label that a "non-goal". Because AFAIK we cannot reliability detect the encoding automatically. > I suspect that's opening a can of worms for false positives similar > to core.autocrlf. And performance drops as we try to guess the > encoding and convert all incoming data. >=20 > So I mention it mostly as a direction I think we probably _don't_ > want to go. Anybody with the "this could have been utf-8 all along" > type of file can remedy it by converting and committing the result. 100 % agree. > Omitting all of the "we shouldn't do this" bullet points, it seems > pretty simple and sane to me. >=20 > There is one other approach, which is to really store utf-16 in the > repository and better teach the diff tools to handle it (which are > really the main thing in git that cares about looking into the blob > contents). You can do this already with a textconv filter, but: >=20 > 1. It's slow (though cacheable). >=20 > 2. It doesn't work unless each repo configures the filter (so not on > sites like GitHub, unless we define a micro-format that diff=3Dutf16= > should be textconv'd on display, and get all implementations to > respect that). Actually, rendering diffs on Git hosting sites such as GitHub is one of my goals. Therefore, storing content as UTF-16 wouldn't be a solution for me. > 3. Textconv patches look good, but can't be applied. This = occasionally > makes things awkward, depending on your workflow. TBH I dont't understand what you mean here. What do you mean with "textconv patches"? > 4. You have to actually mark each file with an attribute, which is > slightly annoying and more thing to remember to do (I see this = from > the server side, since people often commit utf-16 without any > attributes, and then get annoyed when they see the file marked as > binary). >=20 > We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and > doing an "auto-textconv" to utf-8 (for our human-readable diffs only, = of > course). That solves (1), (2), and (4), but leaves (3). I actually > looked into using libicu to do it not just for UTF-16, but to detect = any > encoding. It turned out to be really slow, though. :) >=20 > So anyway, that is an alternate strategy, but I think I like = "canonical > in-repo text is utf-8" approach a lot more, since then git operations > work consistently. There are still a few rough edges (e.g., I'm not = sure > if you could apply a utf-8 patch directly to a utf-16 working tree = file. > Certainly not using "patch", but I'm not sure how well "git apply" = would > handle that case either). But I think it would mostly Just Work as = long > as people were willing to set their encoding attributes. Agreed! Thanks for your thoughts! I'll try to come up with a v1 of this idea. - Lars=