git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Junio C Hamano <junkio@cox.net>
Cc: "Nicolas Pitre" <nico@cam.org>,
	"Uwe Kleine-König" <zeisberg@informatik.uni-freiburg.de>,
	git@vger.kernel.org
Subject: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
Date: Fri, 22 Dec 2006 22:03:53 +0100 (CET)	[thread overview]
Message-ID: <Pine.LNX.4.63.0612222201200.19693@wbgn013.biozentrum.uni-wuerzburg.de> (raw)
In-Reply-To: <7vslf7zrdp.fsf@assigned-by-dhcp.cox.net>


This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text().

The most important is probably utf8_strlen(), which returns the length
of the text, if it is in UTF-8, otherwise -1.

Note that we do not go the full nine yards: we could also check that
the character is encoded with the minimum amount of bytes, as pointed
out by Uwe Kleine-Koenig.

The function print_wrapped_text() can be used to wrap text to a certain
line length.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---

	On Fri, 22 Dec 2006, Junio C Hamano wrote:

	> Nicolas Pitre <nico@cam.org> writes:
	> 
	> > On Fri, 22 Dec 2006, Johannes Schindelin wrote:
	> >> 
	> >> On Thu, 21 Dec 2006, Junio C Hamano wrote:
	> >> 
	> >> >  (2) update commit-tree to reject non utf-8 log messages and
	> >> >      author/committer names when i18n.commitEncoding is _NOT_
	> >> >      set, or set to utf-8.
	> >> 
	> >> The problem is: you cannot easily recognize if it is UTF8 or 
	> >> not, programatically. There is a good indicator _against_ 
	> >> UTF8, namely the first byte can _only_ be 0xxxxxxx, 110xxxxx, 
	> >> 1110xxxx, 11110xxx. But there is no _positive_ sign that it 
	> >> is UTF8. For example, many umlauts and other special 
	> >> modifications to letters, stay in the range 0x7f-0xff.
	> >
	> > Still... that would be a good enough thing to have in the 
	> > majority of cases, wouldn't it?
	> 
	> I think that would be very sane thing to do.

	Well, this patch together with the next one implements that.

 Makefile |    6 ++-
 utf8.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 utf8.h   |    8 +++++
 3 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 29c4662..b4ca48b 100644
--- a/Makefile
+++ b/Makefile
@@ -237,7 +237,8 @@ LIB_H = \
 	archive.h blob.h cache.h commit.h csum-file.h delta.h grep.h \
 	diff.h object.h pack.h pkt-line.h quote.h refs.h list-objects.h sideband.h \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
-	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h
+	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
+	utf8.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -256,7 +257,8 @@ LIB_OBJS = \
 	revision.o pager.o tree-walk.o xdiff-interface.o \
 	write_or_die.o trace.o list-objects.o grep.o \
 	alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
-	color.o wt-status.o archive-zip.o archive-tar.o shallow.o
+	color.o wt-status.o archive-zip.o archive-tar.o shallow.o \
+	utf8.o
 
 BUILTIN_OBJS = \
 	builtin-add.o \
diff --git a/utf8.c b/utf8.c
new file mode 100644
index 0000000..06a66c7
--- /dev/null
+++ b/utf8.c
@@ -0,0 +1,93 @@
+#include "git-compat-util.h"
+#include "utf8.h"
+
+/*
+ * This function returns the number of bytes occupied by the character
+ * pointed to by the variable start. If it is not valid UTF-8, it
+ * returns -1.
+ */
+int utf8_byte_count(const char *start)
+{
+	unsigned char c = *(unsigned char *)start;
+	int i, count = 0;
+
+	if (!(c & 0x80))
+		count = 1;
+	else if ((c & 0xe0) == 0xc0)
+		count = 2;
+	else if ((c & 0xf0) == 0xe0)
+		count = 3;
+	else if ((c & 0xf8) == 0xf0)
+		count = 4;
+	else
+		return -1;
+
+	for (i = 1; i < count; i++)
+		if ((start[i] & 0xc0) != 0x80)
+			return -1;
+	return count;
+}
+
+int utf8_strlen(const char *text)
+{
+	int len = 0;
+	while (*text) {
+		int count = utf8_byte_count(text);
+		if (count < 0)
+			return -1;
+		len += count;
+		text += count;
+	}
+	return len;
+}
+
+static void print_spaces(int count)
+{
+	static const char s[] = "                    ";
+	while (count >= sizeof(s)) {
+		fwrite(s, sizeof(s) - 1, 1, stdout);
+		count -= sizeof(s) - 1;
+	}
+	fwrite(s, count, 1, stdout);
+}
+
+/*
+ * Wrap the text, if necessary. The variable indent is the indent for the
+ * first line, indent2 is the indent for all other lines.
+ */
+void print_wrapped_text(const char *text, int indent, int indent2, int len)
+{
+	int count = 0, space = -1;
+	int l = utf8_strlen(text), assume_utf8 = (l >= 0);
+
+	l = indent;
+
+	for (;;) {
+		char c = text[count];
+		if (!c || isspace(c)) {
+			if (l < len || space < 0) {
+				const char *start = text;
+				if (space >= 0)
+					start += space;
+				else
+					print_spaces(indent);
+				fwrite(start, text + count - start, 1, stdout);
+				if (!c) {
+					putchar('\n');
+					return;
+				} else if (c == '\t')
+					l |= 0x07;
+				space = count;
+			} else {
+				putchar('\n');
+				text += space + 1;
+				indent = indent2;
+				space = -1;
+				count = l = 0;
+				continue;
+			}
+		}
+		count += assume_utf8 ? utf8_byte_count(text + count) : 1;
+		l++;
+	}
+}
diff --git a/utf8.h b/utf8.h
new file mode 100644
index 0000000..96dded9
--- /dev/null
+++ b/utf8.h
@@ -0,0 +1,8 @@
+#ifndef GIT_UTF8_H
+#define GIT_UTF8_H
+
+int utf8_byte_count(const char *start);
+int utf8_strlen(const char *text);
+void print_wrapped_text(const char *text, int indent, int indent2, int len);
+
+#endif
-- 
1.4.4.3.ge5f98-dirty

  reply	other threads:[~2006-12-22 21:03 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig
2006-12-19 14:16 ` Uwe Kleine-König
2006-12-19 17:27   ` Junio C Hamano
2006-12-21  8:59     ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
2006-12-21  9:51       ` Johannes Schindelin
2006-12-21 10:11         ` Santi Béjar
2006-12-21 10:23         ` Alexander Litvinov
2006-12-21 10:52           ` Jakub Narebski
2006-12-21 13:05             ` Alexander Litvinov
2006-12-21 13:14               ` Jakub Narebski
2006-12-21 13:43             ` Uwe Kleine-König
2006-12-21 18:19           ` specify charset for commits Junio C Hamano
2006-12-21 18:48             ` Nicolas Pitre
2006-12-21 19:11             ` Uwe Kleine-König
2006-12-21 19:36             ` Alexander Litvinov
2006-12-22 12:07             ` Johannes Schindelin
2006-12-22 15:09               ` Uwe Kleine-König
2006-12-22 22:02                 ` Uwe Kleine-König
2006-12-22 15:31               ` Nicolas Pitre
2006-12-22 19:01                 ` Junio C Hamano
2006-12-22 21:03                   ` Johannes Schindelin [this message]
2006-12-22 21:27                     ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Junio C Hamano
2006-12-22 21:36                       ` Johannes Schindelin
2006-12-22 21:58                         ` Junio C Hamano
2006-12-22 22:20                           ` Johannes Schindelin
2006-12-22 22:33                             ` Junio C Hamano
2006-12-25  4:03                             ` Alexander Litvinov
2006-12-22 22:14                         ` Uwe Kleine-König
2006-12-22 22:19                     ` Uwe Kleine-König
2006-12-22 22:34                       ` Johannes Schindelin
2006-12-22 23:50                         ` Johannes Schindelin
2006-12-23  8:52                           ` Uwe Kleine-König
2006-12-23 14:12                             ` Johannes Schindelin
2006-12-23 19:53                           ` warn non utf-8 commit log messages Junio C Hamano
2006-12-23 23:46                             ` Johannes Schindelin
2006-12-22 21:06                   ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
2006-12-22 21:50                     ` Junio C Hamano
2006-12-22 22:21                       ` Johannes Schindelin
2006-12-22 21:15                   ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.63.0612222201200.19693@wbgn013.biozentrum.uni-wuerzburg.de \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=nico@cam.org \
    --cc=zeisberg@informatik.uni-freiburg.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).