From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id B33D01F453 for ; Tue, 22 Jan 2019 20:13:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726130AbfAVUNw (ORCPT ); Tue, 22 Jan 2019 15:13:52 -0500 Received: from mail-wr1-f66.google.com ([209.85.221.66]:38372 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726095AbfAVUNw (ORCPT ); Tue, 22 Jan 2019 15:13:52 -0500 Received: by mail-wr1-f66.google.com with SMTP id v13so28892647wrw.5 for ; Tue, 22 Jan 2019 12:13:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=UrDEz4DzqnMcw9VHCeYpPFSdEu84yU4qyRRHHreUvLM=; b=oPrDWpNO/v5PsIa/V85R9QQQkDnnGYbpfg6JTfJ4bRb/lDi0XqtTAh6OWTtjSreBoz sO64D5M8cuZIaY4VrC0AvfmJBf0MECPeCotSehm8Q663/FHTlSJW8dffqejig0fYdGBe /NNB2syNHTRmYicsVZTQN29TA/J36B5J0O7Ek3oPUV7Ul5j5MWC+o4Eh5Tn+Tk50g9vd SNx1fJ6vw/cBKNsY3fMauBxQgPRoZiA/OMMUjIrwL8173AI+pADUuP3QC3hTaqTQi+H8 TBCcrBmLjtKyyXtSC59b6I5OuhwHCdljiTx8TvkImIDddHnVEaGE64UySDzGx8REdABa Ra+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:references:date :in-reply-to:message-id:user-agent:mime-version; bh=UrDEz4DzqnMcw9VHCeYpPFSdEu84yU4qyRRHHreUvLM=; b=GFicxTJt6liZ+oBQOIbLPT7DGKb9h9EejnzDSCXwGh1A5hef+3psjbhTFSOTEfOcCW 6LBfLFGatATQQAc+0ln15xFquqlpspZ3V8httrsq4FAgTGi91ckKAmv9glribcb9yngp l+F76iWPCoWovxGpO51qw2ZZla7pbm+GAcm3zPwm40JhUpqfLtNH8+gsfKHBQSSoEe34 G30cZZTOOQVDCYTObANxSK7rRU0IouFilY1Gri780M0VmTtucLja1Hiy2jnt4UUUga6j BzG+rZppVGGR8eNsoOpeDmFwdZgWzZtenzWRYZPbGBoqNQlj/npYO1IuRANRLotexZmA TYhg== X-Gm-Message-State: AJcUukdZIW997dXbav5WXW13/iy0hqOs4WgsfBBNF+qmj82fZlgIBiWQ 4KdL42QAse9OqihncubvK2g= X-Google-Smtp-Source: ALg8bN4mslSSRvUGla37wsALmUwxQTZ+tSmstqiqjvenzmvRk8mko3sJ739iB+wZDOeANp/wuQ18qg== X-Received: by 2002:adf:9d4c:: with SMTP id o12mr32447093wre.94.1548188029736; Tue, 22 Jan 2019 12:13:49 -0800 (PST) Received: from localhost (112.68.155.104.bc.googleusercontent.com. [104.155.68.112]) by smtp.gmail.com with ESMTPSA id 198sm98048501wmt.36.2019.01.22.12.13.48 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 22 Jan 2019 12:13:49 -0800 (PST) From: Junio C Hamano To: tboegi@web.de Cc: git@vger.kernel.org, adrigibal@gmail.com Subject: Re: [PATCH v2 1/1] Support working-tree-encoding "UTF-16LE-BOM" References: <20190120164327.3234-1-tboegi@web.de> Date: Tue, 22 Jan 2019 12:13:48 -0800 In-Reply-To: <20190120164327.3234-1-tboegi@web.de> (tboegi's message of "Sun, 20 Jan 2019 17:43:27 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org tboegi@web.de writes: > The unicode standard itself defines 3 possible ways how to encode UTF-16. > a) UTF-16, without BOM, big endian: > b) UTF-16, with BOM, little endian: > c) UTF-16, with BOM, big endian: Is it OK to interpret "possible" as "allowed" above? > iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE: > > d) UTF-16 > $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c > 0000000 376 377 \0 g \0 i \0 t So among three, encoder can only do "big endian with BOM" (c). Lack of (a) "big endian without BOM" in the encoder is not a problem in practice, as you can ask UTF-16BE to produce the stream, tell the decoder that you have UTF-16 and the lack of the BOM would make the decoder take it as (a). But lack of (b) "little endian with BOM" is a problem. So the proposal is to invent UTF-16-[BL]E-BOM that prepends BOM in front of UTF-16-[BL]E output to allow those who want (b). Which makes sense, I guess. I do find it a bit ugly in the sense that it is something iconv should learn to do, as the issue is shared with all applications that want to use libiconv and convert into UTF-16. Do you add UTF-16-BE-BOM for consistency? It would be identical to telling iconv to encode to UTF-16, if I understood your problem description correctly. > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index b8392fc330..4a88ab8be7 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -343,13 +343,13 @@ automatic line ending conversion based on your platform. > ------------------------ > > Use the following attributes if your '*.ps1' files are UTF-16 little > -endian encoded without BOM and you want Git to use Windows line endings > +endian encoded with BOM and you want Git to use Windows line endings > in the working directory. Please note, it is highly recommended to > explicitly define the line endings with `eol` if the `working-tree-encoding` > attribute is used to avoid ambiguity. > > ------------------------ > -*.ps1 text working-tree-encoding=UTF-16LE eol=CRLF > +*.ps1 text working-tree-encoding=UTF-16LE-BOM eol=CRLF > ------------------------ This change is robbing from those who do want a file without BOM to give to those who do want a file with BOM. Are the latter class of people the majority of the intended readers (read: Windows folks)? I wonder if the following, instead of the above hunk, would work better: endian encoded without BOM and you want Git to use Windows line endings -in the working directory. Please note, it is highly recommended to +in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if +you want UTF-16 little endian with BOM). +Please note, it is highly recommended to explicitly define the line endings with `eol` if the `working-tree-encoding` > @@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz, > { > iconv_t conv; > char *out; > + const char *bom_str = NULL; > + size_t bom_len = 0; > > if (!in_encoding) > return NULL; > > + /* UTF-16LE-BOM is the same as UTF-16 for reading */ > + if (same_utf_encoding("UTF-16LE-BOM", in_encoding)) > + in_encoding = "UTF-16"; > + > + /* > + * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM" > + * Some users under Windows want the little endian version > + */ > + if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) { > + bom_str = utf16_le_bom; > + bom_len = sizeof(utf16_le_bom); > + out_encoding = "UTF-16LE"; > + } else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) { > + bom_str = utf16_be_bom; > + bom_len = sizeof(utf16_be_bom); > + out_encoding = "UTF-16BE"; OK, you do allow BE-BOM and the code does not rely on the fact that iconv happens to produce it with "UTF-16", because the library is free to switch between the three possible output (a)-(c) and we do not want to get affected by such a switch. Makes sense.