From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Junio C Hamano <junkio@cox.net>
Cc: "Nicolas Pitre" <nico@cam.org>,
"Uwe Kleine-König" <zeisberg@informatik.uni-freiburg.de>,
git@vger.kernel.org
Subject: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
Date: Fri, 22 Dec 2006 22:03:53 +0100 (CET) [thread overview]
Message-ID: <Pine.LNX.4.63.0612222201200.19693@wbgn013.biozentrum.uni-wuerzburg.de> (raw)
In-Reply-To: <7vslf7zrdp.fsf@assigned-by-dhcp.cox.net>
This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text().
The most important is probably utf8_strlen(), which returns the length
of the text, if it is in UTF-8, otherwise -1.
Note that we do not go the full nine yards: we could also check that
the character is encoded with the minimum amount of bytes, as pointed
out by Uwe Kleine-Koenig.
The function print_wrapped_text() can be used to wrap text to a certain
line length.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
On Fri, 22 Dec 2006, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > On Fri, 22 Dec 2006, Johannes Schindelin wrote:
> >>
> >> On Thu, 21 Dec 2006, Junio C Hamano wrote:
> >>
> >> > (2) update commit-tree to reject non utf-8 log messages and
> >> > author/committer names when i18n.commitEncoding is _NOT_
> >> > set, or set to utf-8.
> >>
> >> The problem is: you cannot easily recognize if it is UTF8 or
> >> not, programatically. There is a good indicator _against_
> >> UTF8, namely the first byte can _only_ be 0xxxxxxx, 110xxxxx,
> >> 1110xxxx, 11110xxx. But there is no _positive_ sign that it
> >> is UTF8. For example, many umlauts and other special
> >> modifications to letters, stay in the range 0x7f-0xff.
> >
> > Still... that would be a good enough thing to have in the
> > majority of cases, wouldn't it?
>
> I think that would be very sane thing to do.
Well, this patch together with the next one implements that.
Makefile | 6 ++-
utf8.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
utf8.h | 8 +++++
3 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/Makefile b/Makefile
index 29c4662..b4ca48b 100644
--- a/Makefile
+++ b/Makefile
@@ -237,7 +237,8 @@ LIB_H = \
archive.h blob.h cache.h commit.h csum-file.h delta.h grep.h \
diff.h object.h pack.h pkt-line.h quote.h refs.h list-objects.h sideband.h \
run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
- tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h
+ tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
+ utf8.h
DIFF_OBJS = \
diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -256,7 +257,8 @@ LIB_OBJS = \
revision.o pager.o tree-walk.o xdiff-interface.o \
write_or_die.o trace.o list-objects.o grep.o \
alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
- color.o wt-status.o archive-zip.o archive-tar.o shallow.o
+ color.o wt-status.o archive-zip.o archive-tar.o shallow.o \
+ utf8.o
BUILTIN_OBJS = \
builtin-add.o \
diff --git a/utf8.c b/utf8.c
new file mode 100644
index 0000000..06a66c7
--- /dev/null
+++ b/utf8.c
@@ -0,0 +1,93 @@
+#include "git-compat-util.h"
+#include "utf8.h"
+
+/*
+ * This function returns the number of bytes occupied by the character
+ * pointed to by the variable start. If it is not valid UTF-8, it
+ * returns -1.
+ */
+int utf8_byte_count(const char *start)
+{
+ unsigned char c = *(unsigned char *)start;
+ int i, count = 0;
+
+ if (!(c & 0x80))
+ count = 1;
+ else if ((c & 0xe0) == 0xc0)
+ count = 2;
+ else if ((c & 0xf0) == 0xe0)
+ count = 3;
+ else if ((c & 0xf8) == 0xf0)
+ count = 4;
+ else
+ return -1;
+
+ for (i = 1; i < count; i++)
+ if ((start[i] & 0xc0) != 0x80)
+ return -1;
+ return count;
+}
+
+int utf8_strlen(const char *text)
+{
+ int len = 0;
+ while (*text) {
+ int count = utf8_byte_count(text);
+ if (count < 0)
+ return -1;
+ len += count;
+ text += count;
+ }
+ return len;
+}
+
+static void print_spaces(int count)
+{
+ static const char s[] = " ";
+ while (count >= sizeof(s)) {
+ fwrite(s, sizeof(s) - 1, 1, stdout);
+ count -= sizeof(s) - 1;
+ }
+ fwrite(s, count, 1, stdout);
+}
+
+/*
+ * Wrap the text, if necessary. The variable indent is the indent for the
+ * first line, indent2 is the indent for all other lines.
+ */
+void print_wrapped_text(const char *text, int indent, int indent2, int len)
+{
+ int count = 0, space = -1;
+ int l = utf8_strlen(text), assume_utf8 = (l >= 0);
+
+ l = indent;
+
+ for (;;) {
+ char c = text[count];
+ if (!c || isspace(c)) {
+ if (l < len || space < 0) {
+ const char *start = text;
+ if (space >= 0)
+ start += space;
+ else
+ print_spaces(indent);
+ fwrite(start, text + count - start, 1, stdout);
+ if (!c) {
+ putchar('\n');
+ return;
+ } else if (c == '\t')
+ l |= 0x07;
+ space = count;
+ } else {
+ putchar('\n');
+ text += space + 1;
+ indent = indent2;
+ space = -1;
+ count = l = 0;
+ continue;
+ }
+ }
+ count += assume_utf8 ? utf8_byte_count(text + count) : 1;
+ l++;
+ }
+}
diff --git a/utf8.h b/utf8.h
new file mode 100644
index 0000000..96dded9
--- /dev/null
+++ b/utf8.h
@@ -0,0 +1,8 @@
+#ifndef GIT_UTF8_H
+#define GIT_UTF8_H
+
+int utf8_byte_count(const char *start);
+int utf8_strlen(const char *text);
+void print_wrapped_text(const char *text, int indent, int indent2, int len);
+
+#endif
--
1.4.4.3.ge5f98-dirty
next prev parent reply other threads:[~2006-12-22 21:03 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig
2006-12-19 14:16 ` Uwe Kleine-König
2006-12-19 17:27 ` Junio C Hamano
2006-12-21 8:59 ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
2006-12-21 9:51 ` Johannes Schindelin
2006-12-21 10:11 ` Santi Béjar
2006-12-21 10:23 ` Alexander Litvinov
2006-12-21 10:52 ` Jakub Narebski
2006-12-21 13:05 ` Alexander Litvinov
2006-12-21 13:14 ` Jakub Narebski
2006-12-21 13:43 ` Uwe Kleine-König
2006-12-21 18:19 ` specify charset for commits Junio C Hamano
2006-12-21 18:48 ` Nicolas Pitre
2006-12-21 19:11 ` Uwe Kleine-König
2006-12-21 19:36 ` Alexander Litvinov
2006-12-22 12:07 ` Johannes Schindelin
2006-12-22 15:09 ` Uwe Kleine-König
2006-12-22 22:02 ` Uwe Kleine-König
2006-12-22 15:31 ` Nicolas Pitre
2006-12-22 19:01 ` Junio C Hamano
2006-12-22 21:03 ` Johannes Schindelin [this message]
2006-12-22 21:27 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Junio C Hamano
2006-12-22 21:36 ` Johannes Schindelin
2006-12-22 21:58 ` Junio C Hamano
2006-12-22 22:20 ` Johannes Schindelin
2006-12-22 22:33 ` Junio C Hamano
2006-12-25 4:03 ` Alexander Litvinov
2006-12-22 22:14 ` Uwe Kleine-König
2006-12-22 22:19 ` Uwe Kleine-König
2006-12-22 22:34 ` Johannes Schindelin
2006-12-22 23:50 ` Johannes Schindelin
2006-12-23 8:52 ` Uwe Kleine-König
2006-12-23 14:12 ` Johannes Schindelin
2006-12-23 19:53 ` warn non utf-8 commit log messages Junio C Hamano
2006-12-23 23:46 ` Johannes Schindelin
2006-12-22 21:06 ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
2006-12-22 21:50 ` Junio C Hamano
2006-12-22 22:21 ` Johannes Schindelin
2006-12-22 21:15 ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.63.0612222201200.19693@wbgn013.biozentrum.uni-wuerzburg.de \
--to=johannes.schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=nico@cam.org \
--cc=zeisberg@informatik.uni-freiburg.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).