[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5. Characters

Characters are objects that represent printed characters, such as letters and digits.

5.1 External Representation of Characters  
5.2 Comparison of Characters  
5.3 Miscellaneous Character Operations  
5.4 Internal Representation of Characters  
5.5 ISO-8859-1 Characters  
5.6 Character Sets  
5.7 Unicode  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.1 External Representation of Characters

Characters are written using the notation #\character or #\character-name. For example:

 
#\a                     ; lowercase letter
#\A                     ; uppercase letter
#\(                     ; left parenthesis
#\space                 ; the space character
#\newline               ; the newline character

Case is significant in #\character, but not in #\character-name. If character in #\character is a letter, character must be followed by a delimiter character such as a space or parenthesis. Characters written in the #\ notation are self-evaluating; you don't need to quote them.

In addition to the standard character syntax, MIT Scheme also supports a general syntax that denotes any Unicode character by its code point. This notation is #\U+code-point, where code-point is a sequence of hexadecimal digits for a valid code point. So the above examples could also be written like this:

 
#\U+61                  ; lowercase letter
#\U+41                  ; uppercase letter
#\U+28                  ; left parenthesis
#\U+20                  ; the space character
#\U+0A                  ; the newline character

A character name may include one or more bucky bit prefixes to indicate that the character includes one or more of the keyboard shift keys Control, Meta, Super, or Hyper (note that the Control bucky bit prefix is not the same as the ASCII control key). The bucky bit prefixes and their meanings are as follows (case is not significant):

 
Key             Bucky bit prefix        Bucky bit
---             ----------------        ---------

Meta            M- or Meta-                 1
Control         C- or Control-              2
Super           S- or Super-                4
Hyper           H- or Hyper-                8

For example,

 
#\c-a                   ; Control-a
#\meta-b                ; Meta-b
#\c-s-m-h-a             ; Control-Meta-Super-Hyper-A

The following character-names are supported, shown here with their ASCII equivalents:

 
Character Name          ASCII Name
--------------          ----------

altmode                 ESC
backnext                US
backspace               BS
call                    SUB
linefeed                LF
page                    FF
return                  CR
rubout                  DEL
space
tab                     HT

In addition, #\newline is the same as #\linefeed (but this may change in the future, so you should not depend on it). All of the standard ASCII names for non-printing characters are supported:

 
NUL     SOH     STX     ETX     EOT     ENQ     ACK     BEL
BS      HT      LF      VT      FF      CR      SO      SI
DLE     DC1     DC2     DC3     DC4     NAK     SYN     ETB
CAN     EM      SUB     ESC     FS      GS      RS      US
DEL

procedure: char->name char [slashify?]
Returns a string corresponding to the printed representation of char. This is the character or character-name component of the external representation, combined with the appropriate bucky bit prefixes.

 
(char->name #\a)                        =>  "a"
(char->name #\space)                    =>  "Space"
(char->name #\c-a)                      =>  "C-a"
(char->name #\control-a)                =>  "C-a"

Slashify?, if specified and true, says to insert the necessary backslash characters in the result so that read will parse it correctly. In other words, the following generates the external representation of char:

 
(string-append "#\\" (char->name char #t))

If slashify? is not specified, it defaults to #f.

procedure: name->char string
Converts a string that names a character into the character specified. If string does not name any character, name->char signals an error.

 
(name->char "a")                        =>  #\a
(name->char "space")                    =>  #\Space
(name->char "c-a")                      =>  #\C-a
(name->char "control-a")                =>  #\C-a


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.2 Comparison of Characters

procedure: char=? char1 char2
procedure: char<? char1 char2
procedure: char>? char1 char2
procedure: char<=? char1 char2
procedure: char>=? char1 char2
procedure: char-ci=? char1 char2
procedure: char-ci<? char1 char2
procedure: char-ci>? char1 char2
procedure: char-ci<=? char1 char2
procedure: char-ci>=? char1 char2
Returns #t if the specified characters are have the appropriate order relationship to one another; otherwise returns #f. The -ci procedures don't distinguish uppercase and lowercase letters.

Character ordering follows these portability rules:

MIT/GNU Scheme uses a specific character ordering, in which characters have the same order as their corresponding integers. See the documentation for char->integer for further details.

Note: Although character objects can represent all of Unicode, the model of alphabetic case used covers only ASCII letters, which means that case-insensitive comparisons and case conversions are incorrect for non-ASCII letters. This will eventually be fixed.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.3 Miscellaneous Character Operations

procedure: char? object
Returns #t if object is a character; otherwise returns #f.

procedure: char-upcase char
procedure: char-downcase char
Returns the uppercase or lowercase equivalent of char if char is a letter; otherwise returns char. These procedures return a character char2 such that (char-ci=? char char2).

Note: Although character objects can represent all of Unicode, the model of alphabetic case used covers only ASCII letters, which means that case-insensitive comparisons and case conversions are incorrect for non-ASCII letters. This will eventually be fixed.

procedure: char->digit char [radix]
If char is a character representing a digit in the given radix, returns the corresponding integer value. If you specify radix (which must be an exact integer between 2 and 36 inclusive), the conversion is done in that base, otherwise it is done in base 10. If char doesn't represent a digit in base radix, char->digit returns #f.

Note that this procedure is insensitive to the alphabetic case of char.

 
(char->digit #\8)                       =>  8
(char->digit #\e 16)                    =>  14
(char->digit #\e)                       =>  #f

procedure: digit->char digit [radix]
Returns a character that represents digit in the radix given by radix. Radix must be an exact integer between 2 and 36 (inclusive), and defaults to 10. Digit, which must be an exact non-negative integer, should be less than radix; if digit is greater than or equal to radix, digit->char returns #f.

 
(digit->char 8)                         =>  #\8
(digit->char 14 16)                     =>  #\E


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4 Internal Representation of Characters

An MIT/GNU Scheme character consists of a code part and a bucky bits part. The MIT/GNU Scheme set of characters can represent more characters than ASCII can; it includes characters with Super and Hyper bucky bits, as well as Control and Meta. Every ASCII character corresponds to some MIT/GNU Scheme character, but not vice versa.(5)

MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits. The character code contains the Unicode code point for the character. This is a change from earlier versions of the system, which used the ISO-8859-1 code point, but it is upwards compatible with previous usage, since ISO-8859-1 is a proper subset of Unicode.

procedure: make-char code bucky-bits
Builds a character from code and bucky-bits. Both code and bucky-bits must be exact non-negative integers in the appropriate range. Use char-code and char-bits to extract the code and bucky bits from the character. If 0 is specified for bucky-bits, make-char produces an ordinary character; otherwise, the appropriate bits are turned on as follows:

 
1               Meta
2               Control
4               Super
8               Hyper

For example,

 
(make-char 97 0)                        =>  #\a
(make-char 97 1)                        =>  #\M-a
(make-char 97 2)                        =>  #\C-a
(make-char 97 3)                        =>  #\C-M-a

procedure: char-bits char
Returns the exact integer representation of char's bucky bits. For example,

 
(char-bits #\a)                         =>  0
(char-bits #\m-a)                       =>  1
(char-bits #\c-a)                       =>  2
(char-bits #\c-m-a)                     =>  3

procedure: char-code char
Returns the character code of char, an exact integer. For example,

 
(char-code #\a)                         =>  97
(char-code #\c-a)                       =>  97

Note that in MIT/GNU Scheme, the value of char-code is the Unicode code point for char.

variable: char-code-limit
variable: char-bits-limit
These variables define the (exclusive) upper limits for the character code and bucky bits (respectively). The character code and bucky bits are always exact non-negative integers, and are strictly less than the value of their respective limit variable.

procedure: char->integer char
procedure: integer->char k
char->integer returns the character code representation for char. integer->char returns the character whose character code representation is k.

In MIT/GNU Scheme, if (char-ascii? char) is true, then

 
(eqv? (char->ascii char) (char->integer char))

However, this behavior is not required by the Scheme standard, and code that depends on it is not portable to other implementations.

These procedures implement order isomorphisms between the set of characters under the char<=? ordering and some subset of the integers under the <= ordering. That is, if

 
(char<=? a b)  =>  #t    and    (<= x y)  =>  #t

and x and y are in the range of char->integer, then

 
(<= (char->integer a)
    (char->integer b))                  =>  #t
(char<=? (integer->char x)
         (integer->char y))             =>  #t

In MIT/GNU Scheme, the specific relationship implemented by these procedures is as follows:

 
(define (char->integer c)
  (+ (* (char-bits c) #x200000)
     (char-code c)))

(define (integer->char n)
  (make-char (remainder n #x200000)
             (quotient n #x200000)))

This implies that char->integer and char-code produce identical results for characters that have no bucky bits set, and that characters are ordered according to their Unicode code points.

Note: If the argument to char->integer or integer->char is a constant, the compiler will constant-fold the call, replacing it with the corresponding result. This is a very useful way to denote unusual character constants or ASCII codes.

variable: char-integer-limit
The range of char->integer is defined to be the exact non-negative integers that are less than the value of this variable (exclusive). Note, however, that there are some holes in this range, because the character code must be a valid Unicode code point.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.5 ISO-8859-1 Characters

MIT/GNU Scheme internally uses ISO-8859-1 codes for I/O, and stores character objects in a fashion that makes it convenient to convert between ISO-8859-1 codes and characters. Also, character strings are implemented as byte vectors whose elements are ISO-8859-1 codes; these codes are converted to character objects when accessed. For these reasons it is sometimes desirable to be able to convert between ISO-8859-1 codes and characters.

Not all characters can be represented as ISO-8859-1 codes. A character that has an equivalent ISO-8859-1 representation is called an ISO-8859-1 character.

For historical reasons, the procedures that manipulate ISO-8859-1 characters use the word "ASCII" rather than "ISO-8859-1".

procedure: char-ascii? char
Returns the ISO-8859-1 code for char if char has an ISO-8859-1 representation; otherwise returns #f.

In the current implementation, the characters that satisfy this predicate are those in which the bucky bits are turned off, and for which the character code is less than 256.

procedure: char->ascii char
Returns the ISO-8859-1 code for char. An error condition-type:bad-range-argument is signalled if char doesn't have an ISO-8859-1 representation.

procedure: ascii->char code
Code must be the exact integer representation of an ISO-8859-1 code. This procedure returns the character corresponding to code.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.6 Character Sets

MIT/GNU Scheme's character-set abstraction is used to represent groups of characters, such as the letters or digits. Character sets may contain only ISO-8859-1 characters; use the alphabet abstraction (see section 5.7 Unicode if you need to cover the entire Unicode range.

procedure: char-set? object
Returns #t if object is a character set; otherwise returns #f.

variable: char-set:upper-case
variable: char-set:lower-case
variable: char-set:alphabetic
variable: char-set:numeric
variable: char-set:alphanumeric
variable: char-set:whitespace
variable: char-set:not-whitespace
variable: char-set:graphic
variable: char-set:not-graphic
variable: char-set:standard
These variables contain predefined character sets. To see the contents of one of these sets, use char-set-members.

Alphabetic characters are the 52 upper and lower case letters. Numeric characters are the 10 decimal digits. Alphanumeric characters are those in the union of these two sets. Whitespace characters are #\space, #\tab, #\page, #\linefeed, and #\return. Graphic characters are the printing characters and #\space. Standard characters are the printing characters, #\space, and #\newline. These are the printing characters:

 
! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9
: ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
[ \ ] ^ _ `
a b c d e f g h i j k l m n o p q r s t u v w x y z
{ | } ~

procedure: char-upper-case? char
procedure: char-lower-case? char
procedure: char-alphabetic? char
procedure: char-numeric? char
procedure: char-alphanumeric? char
procedure: char-whitespace? char
procedure: char-graphic? char
procedure: char-standard? object
These predicates are defined in terms of the respective character sets defined above.

procedure: char-set-members char-set
Returns a newly allocated list of the characters in char-set.

procedure: char-set-member? char-set char
Returns #t if char is in char-set; otherwise returns #f.

procedure: char-set=? char-set-1 char-set-2
Returns #t if char-set-1 and char-set-2 contain exactly the same characters; otherwise returns #f.

procedure: char-set char ...
Returns a character set consisting of the specified ISO-8859-1 characters. With no arguments, char-set returns an empty character set.

procedure: chars->char-set chars
Returns a character set consisting of chars, which must be a list of ISO-8859-1 characters. This is equivalent to (apply char-set chars).

procedure: string->char-set string
Returns a character set consisting of all the characters that occur in string.

procedure: ascii-range->char-set lower upper
Lower and upper must be exact non-negative integers representing ISO-8859-1 character codes, and lower must be less than or equal to upper. This procedure creates and returns a new character set consisting of the characters whose ISO-8859-1 codes are between lower (inclusive) and upper (exclusive).

For historical reasons, the name of this procedure refers to "ASCII" rather than "ISO-8859-1".

procedure: predicate->char-set predicate
Predicate must be a procedure of one argument. predicate->char-set creates and returns a character set consisting of the ISO-8859-1 characters for which predicate is true.

procedure: char-set-difference char-set1 char-set2
Returns a character set consisting of the characters that are in char-set1 but aren't in char-set2.

procedure: char-set-intersection char-set ...
Returns a character set consisting of the characters that are in all of the char-sets.

procedure: char-set-union char-set ...
Returns a character set consisting of the characters that are in at least one o the char-sets.

procedure: char-set-invert char-set
Returns a character set consisting of the ISO-8859-1 characters that are not in char-set.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.7 Unicode

MIT/GNU Scheme provides rudimentary support for Unicode characters. In an ideal world, Unicode would be the base character set for MIT/GNU Scheme. But MIT/GNU Scheme predates the invention of Unicode, and converting an application of this size is a considerable undertaking. So for the time being, the base character set for I/O and strings is ISO-8859-1, and Unicode support is grafted on.

This Unicode support was implemented as a part of the XML parser (see section 14.12 XML Parser) implementation. XML uses Unicode as its base character set, and any XML implementation must support Unicode.

The basic unit in a Unicode implementation is the code point. The character equivalent of a code point is a wide character.

procedure: unicode-code-point? object
Returns #t if object is a Unicode code point, which are implemented as exact non-negative integers. Code points are further limited, by the Unicode standard, to be strictly less than #x110000, with the values #xD800 through #xDFFF, #xFFFE, and #xFFFF excluded.

procedure: wide-char? object
Returns #t if object is a wide character, specifically if object is a character with no bucky bits and whose code satisfies unicode-code-point?.

The Unicode implementation consists of three parts:

5.7.1 Wide Strings  
5.7.2 Unicode Representations  
5.7.3 Alphabets  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.7.1 Wide Strings

Wide characters can be combined into wide strings, which are similar to strings but can contain any Unicode character sequence. The implementation used for wide strings is guaranteed to provide constant-time access to each character in the string.

procedure: wide-string? object
Returns #t if object is a wide string.

procedure: make-wide-string k [wide-char]
Returns a newly allocated wide string of length k. If char is specified, all elements of the returned string are initialized to char; otherwise the contents of the string are unspecified.

procedure: wide-string wide-char ...
Returns a newly allocated wide string consisting of the specified characters.

procedure: wide-string-length wide-string
Returns the length of wide-string as an exact non-negative integer.

procedure: wide-string-ref wide-string k
Returns character k of wide-string. K must be a valid index of string.

procedure: wide-string-set! wide-string k wide-char
Stores char in element k of wide-string and returns an unspecified value. K must be a valid index of wide-string.

procedure: string->wide-string string [start [end]]
Returns a newly allocated wide string with the same contents as string. If start and end are supplied, they specify a substring of string that is to be converted. Start defaults to `0', and end defaults to `(string-length string)'.

procedure: wide-string->string wide-string [start [end]]
Returns a newly allocated string with the same contents as wide-string. The argument wide-string must satisfy wide-string?. If start and end are supplied, they specify a substring of wide-string that is to be converted. Start defaults to `0', and end defaults to `(wide-string-length wide-string)'.

It is an error if any character in wide-string fails to satisfy char-ascii?.

procedure: open-wide-input-string wide-string [start [end]]
Returns a new input port that sources the characters of wide-string. The optional arguments start and end may be used to specify that the port delivers characters from a substring of wide-string; if not given, start defaults to `0' and end defaults to `(wide-string-length wide-string)'.

procedure: open-wide-output-string
Returns an output port that accepts wide characters and strings and accumulates them in a buffer. Call get-output-string on the returned port to get a wide string containing the accumulated characters.

procedure: call-with-wide-output-string procedure
Creates a wide-string output port and calls procedure on that port. The value returned by procedure is ignored, and the accumulated output is returned as a wide string. This is equivalent to:

 
(define (call-with-wide-output-string procedure)
  (let ((port (open-wide-output-string)))
    (procedure port)
    (get-output-string port)))


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.7.2 Unicode Representations

The procedures in this section implement transformations that convert between the internal representation of Unicode characters and several standard external representations. These external representations are all implemented as sequences of bytes, but they differ in their intended usage.

UTF-8
Each character is written as a sequence of one to four bytes.

UTF-16
Each character is written as a sequence of one or two 16-bit integers.

UTF-32
Each character is written as a single 32-bit integer.

The UTF-16 and UTF-32 representations may be serialized to and from a byte stream in either big-endian or little-endian order. In big-endian order, the most significant byte is first, the next most significant byte is second, etc. In little-endian order, the least significant byte is first, etc. All of the UTF-16 and UTF-32 representation procedures are available in both orders, which are indicated by names containing `utfNN-be' and `utfNN-le', respectively. There are also procedures that implement host-endian order, which is either big-endian or little-endian depending on the underlying computer architecture.

procedure: read-utf8-char port
procedure: read-utf16-be-char port
procedure: read-utf16-le-char port
procedure: read-utf16-char port
procedure: read-utf32-be-char port
procedure: read-utf32-le-char port
procedure: read-utf32-char port
Each of these procedures reads a single wide character from the given port. Port is treated as a stream of bytes encoded in the corresponding `utfNN' representation.

procedure: write-utf8-char wide-char port
procedure: write-utf16-be-char wide-char port
procedure: write-utf16-le-char wide-char port
procedure: write-utf32-be-char wide-char port
procedure: write-utf32-le-char wide-char port
procedure: write-utf16-char wide-char port
procedure: write-utf32-char wide-char port
Each of these procedures writes wide-char to the given port. Wide-char is encoded in the corresponding `utfNN' representation and written to port as a stream of bytes.

procedure: utf8-string->wide-string string [start [end]]
procedure: utf16-be-string->wide-string string [start [end]]
procedure: utf16-le-string->wide-string string [start [end]]
procedure: utf16-string->wide-string string [start [end]]
procedure: utf32-be-string->wide-string string [start [end]]
procedure: utf32-le-string->wide-string string [start [end]]
procedure: utf32-string->wide-string string [start [end]]
Each of these procedures converts a byte vector to a wide string, treating string as a stream of bytes encoded in the corresponding `utfNN' representation. The arguments start and end allow specification of a substring; they default to zero and string's length, respectively.

procedure: utf8-string-length string [start [end]]
procedure: utf16-be-string-length string [start [end]]
procedure: utf16-le-string-length string [start [end]]
procedure: utf16-string-length string [start [end]]
procedure: utf32-be-string-length string [start [end]]
procedure: utf32-le-string-length string [start [end]]
procedure: utf32-string-length string [start [end]]
Each of these procedures counts the number of Unicode characters in a byte vector, treating string as a stream of bytes encoded in the corresponding `utfNN' representation. The arguments start and end allow specification of a substring; they default to zero and string's length, respectively.

procedure: wide-string->utf8-string string [start [end]]
procedure: wide-string->utf16-be-string string [start [end]]
procedure: wide-string->utf16-le-string string [start [end]]
procedure: wide-string->utf16-string string [start [end]]
procedure: wide-string->utf32-be-string string [start [end]]
procedure: wide-string->utf32-le-string string [start [end]]
procedure: wide-string->utf32-string string [start [end]]
Each of these procedures converts a wide string to a stream of bytes encoded in the corresponding `utfNN' representation, and returns that stream as a byte vector. The arguments start and end allow specification of a substring; they default to zero and string's length, respectively.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.7.3 Alphabets

Applications often need to manipulate sets of characters, such as the set of alphabetic characters or the set of whitespace characters. The alphabet abstraction provides an efficient implementation of sets of Unicode code points.

procedure: alphabet? object
Returns #t if object is a Unicode alphabet, otherwise returns #f.

procedure: alphabet wide-char ...
Returns a Unicode alphabet containing the wide characters passed as arguments.

procedure: code-points->alphabet items
Returns a Unicode alphabet containing the code points described by items. Items must satisfy well-formed-code-points-list?.

procedure: alphabet->code-points alphabet
Returns a well-formed code-points list that describes the code points represented by alphabet.

procedure: well-formed-code-points-list? object
Returns #t if object is a well-formed code-points list, otherwise returns #f. A well-formed code-points list is a proper list, each element of which is either a code point or a pair of code points. A pair of code points represents a contiguous range of code points. The CAR of the pair is the lower limit, and the CDR is the upper limit. Both limits are inclusive, and the lower limit must be strictly less than the upper limit.

procedure: char-in-alphabet? char alphabet
Returns #t if char is a member of alphabet, otherwise returns #f.

Character sets and alphabets can be converted to one another, provided that the alphabet contains only 8-bit code points. This is true because 8-bit code points in Unicode map directly to ISO-8859-1 characters, which is what character sets contain.

procedure: char-set->alphabet char-set
Returns a Unicode alphabet containing the code points that correspond to characters that are members of char-set.

procedure: alphabet->char-set alphabet
Returns a character set containing the characters that correspond to 8-bit code points that are members of alphabet. (Code points outside the 8-bit range are ignored.)

procedure: string->alphabet string
Returns a Unicode alphabet containing the code points corresponding to the characters in string. Equivalent to

 
(char-set->alphabet (string->char-set string))

procedure: alphabet->string alphabet
Returns a newly-allocated string containing the characters corresponding to the 8-bit code points in alphabet. (Code points outside the 8-bit range are ignored.)

procedure: 8-bit-alphabet? alphabet
Returns #t if alphabet contains only 8-bit code points, otherwise returns #f.

procedure: alphabet+ alphabet ...
Returns a Unicode alphabet that contains each code point that is a member of any of the alphabet arguments.

procedure: alphabet- alphabet1 alphabet2
Returns a Unicode alphabet that contains each code point that is a member of alphabet1 and is not a member of alphabet2.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Chris Hanson on September, 19 2003 using texi2html