DragonFly On-Line Manual Pages

Search: Section:  


UTF(3)                DragonFly Library Functions Manual                UTF(3)

NAME

runetochar, chartorune, runelen, fullrune, utflen, utfrune, utfrrune, utfutf - Unicode Text Format functionality

SYNOPSIS

#include <utf.h> int runetochar(char *cp, Rune *rp); int chartorune(Rune *rp, char *cp); int runelen(long r); int fullrune(char *cp, int n); int utflen(char *s); int utfbytes(char *s); char *utfrune(char *cp, long r); char *utfrrune(char *cp, long r); char *utfutf(char *big, char *little); int utf_snprintf(char *buf, size_t size, char *format, ...); int utfcmp(char *s1, char *s2); int utfncmp(char *s1, char *s2, int rc); char *utfcpy(char *dst, char *src); char *utfncpy(char *dst, char *src, int nbytes); char *utfcat(char *src, char *append); char *utfncat(char *src, char *append, int nbytes);

DESCRIPTION

The UTF routines are used to pack the Unicode text encoding into a standard character stream. To do that effectively, ASCII characters form the lowest 127 characters of UTF-8. These characters are interchangeable between the two character sets. A Rune is a Unicode character, defined in the header file utf.h. runetochar translates a single Rune to a UTF sequence and returns the number of bytes produced. chartorune is the inverse of this function, returning the number of bytes consumed. runelen returns the number of bytes in the encoding of a Rune. fullrune checks that the first n bytes of the UTF string cp contain a complete UTF encoding. utflen returns the number of runes in a UTF string. utbytes returns the number of bytes in a UTF string. utfrune returns a pointer to the first occurrence of a rune in a UTF string. utfrrune returns a pointer to the last. utfutf searches for the first occurrence of a UTF string in another UTF string. utf_snprintf is a prticularly dumb implementation of snprintf for utf strings - it only interprets %%, %s and %d sequences in the format string, and does no field width calculation on those. utfcmp compares two strings lexicographically, Rune by Rune, and returns a value greater than 0, equal to zero, or less than zero depending on whether the first UTF string is greater than, the same as, or less than the second string. utfncmp does the same comparison as utfcmp, with a maximum upper bound of rc Runes. utfcpy copies from source to destination, Rune by Rune, and returns its destination string. No bounds checking is done on the number of Runes copied, or their individual sizes. The dst argument is returned. utfncpy copies at most nbytes bytes from source to destination, terminating when a null Rune is found in the source. If the number of bytes copied is less than nbytes, then the destination string is paddedf with null (0) bytes. If it is equal to or greater than nbytes, no zero bytes is added. The dst argument is returned. utfcat appends the UTF string append onto the UTF string src. utfncat appends the UTF string append onto the UTF string src, bearing in mind that the buffer src is only nbytes long.

IMPLEMENTATION

This implementation of UTF, nominally UTF-8, can encode a null Unicode character using a one-byte or a two-byte encoding. Typically, Plan 9 uses a one-byte encoding, whilst Java uses a two-byte encoding. Plan 9 type encoding makes backwards compatibility much easier, and loses nothing - all the Java functionality is there, there are no embedded null bytes in a UTF string, due to the encoding of second and third characters, and ordinary C strings are recognised as well, which is not the case in Java. By default, a one byte Null-byte encoding is used. UTF-8 is defined in X/Open Company Ltd., "File System Safe UCS Transformation Format (FSS_UTF)", X/Open Preliminary Specification, Document Number: P316, which also appears in ISO/IEC 10646, Annex P.

BUGS

Undoubtably, these are many, and legion.

AUTHOR

Written by Alistair Crooks (agc@amdahl.com, or agc@westley.demon.co.uk), from a draft document written by Rob Pike and Ken Thompson, detailing the implementation of UTF in the Plan 9 operating system. UTF(3)

Search: Section: