Null-terminated string

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character ('\0', called NUL in ASCII). Alternative names are C string, which refers to the C programming language and ASCIIZ (although C can use encodings other than ASCII).

The length of a C string is found by searching for the (first) NUL byte. This can be slow as it takes O(n) (linear time) with respect to the string length. It also means that a NUL cannot be inside the string, as the only NUL is the one marking the end.

History[edit]

Null-terminated strings were produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.

At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "length-prefixed"), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time), but limited string length to 255 characters (on a machine using 8-bit bytes). C designer Dennis Ritchie chose to follow the convention of NUL-termination, already established in BCPL, to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator.^[1]

This had some influence on CPU instruction set design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80 and the DEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000 520 in 1992.

FreeBSD developer Poul-Henning Kamp, writing in ACM Queue, would later refer to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.^[2]

Implementations[edit]

C programming language supports null-terminated strings as the primary string type.^[3] There are many functions for string handling in the C standard library. Operations supported include:

Determining the length of a string
Copying one string to another
Appending (concatenating) one string to another
Finding the first (or last) occurrence of a character within a string
Finding within a string the first occurrence of a character in (or not in) a given set
Finding the first occurrence of a substring within a string
Comparing two strings lexicographically
Splitting a string into multiple substrings
Formatting numeric or string values into a printable output string
Parsing a printable string into numeric values
Converting between single-byte and wide character string encodings
Converting single-byte or wide character strings to and from multi-byte character strings

Limitations[edit]

While simple to implement, this representation has been prone to errors and performance problems.

The NUL termination has historically created security problems.^[4] A NUL byte inserted into the middle of a string will truncate it unexpectedly. A common bug was to not allocate the additional space for the NUL, so it was written over adjacent memory. Another was to not write the NUL at all, which was often not detected during testing because a NUL was already there by chance from previous use of the same block of memory. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size buffer, causing a buffer overflow if it was too long.

The inability to store a NUL requires that string data and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.

The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(n) anyway, such as in strlcpy. However, this does not always result in an intuitive API.

Character encodings[edit]

Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere, therefore it is not possible to store every possible ASCII or UTF-8 string.^[5]^[6]^[7] However, it is common to store the subset of ASCII or UTF-8 – every character except the NUL character – in null-terminated strings. Some systems use "modified UTF-8" which encodes the NUL character as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard as it is a security risk. A 0xC0, 0x80 NUL might be seen as a string terminator in security validation and as a character when used. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.

UTF-16 uses 2-byte integers and as either byte may be zero (and in fact every other byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit UTF-16 characters, terminated by a 16-bit NUL character. (Again the NUL character, which encodes as a single zero code unit, is the only character that cannot be stored. UTF-16 does not have any alternative encoding of zero).

Improvements[edit]

Many attempts to make C string handling less error prone have been made. One strategy is to add safer functions such as strdup and strlcpy, whilst deprecating the use of unsafe functions such as gets. Another is to add an object-oriented wrapper around C strings so that only safe calls can be done. Neither has had a huge success as it is always possible and tempting to call the unsafe functions anyway.

Most modern libraries replace C strings with a structure containing a 32-bit or larger length value (far more than were ever considered for length-prefixed strings), and often add another pointer, a reference count, and even a NUL to speed up conversion back to a C string. Memory is far larger now, such that if the addition of 3 (or 16, or more) bytes to each string is a real problem the software will have to be dealing with so many small strings that some other storage method will save even more memory (for instance there may be so many duplicates that a hash table will use less memory). Examples include the C++ Standard Template Library std::string, the Qt QString, the MFC CString, and the C-based implementation CFString from Core Foundation as well as its Objective-C sibling NSString from Foundation, both by Apple. More complex structures may also be used to store strings such as the rope.

References[edit]

^ Dennis M. Ritchie (1993). [The development of the C language]. Proc. 2nd History of Programming Languages Conf.
^ Kamp, Poul-Henning (25 July 2011), "The Most Expensive One-byte Mistake", ACM Queue, 9 (7), ISSN 1542-7730, retrieved 2 August 2011
^ Richie, Dennis (2003). "The Development of the C Language". Retrieved 9 November 2011.
^ Rain Forest Puppy (9 September 1999). "Perl CGI problems". Phrack Magazine. artofhacking.com. 9 (55): 7. Retrieved 3 January 2016.
^ "UTF-8, a transformation format of ISO 10646". Retrieved 19 September 2013.
^ "Unicode/UTF-8-character table". Retrieved 13 September 2013.
^ Kuhn, Markus. "UTF-8 and Unicode FAQ". Retrieved 13 September 2013.

[1] Dennis M. Ritchie (1993). [The development of the C language]. Proc. 2nd History of Programming Languages Conf.

[2] Kamp, Poul-Henning (25 July 2011), "The Most Expensive One-byte Mistake", ACM Queue, 9 (7), ISSN 1542-7730, retrieved 2 August 2011

[3] Richie, Dennis (2003). "The Development of the C Language". Retrieved 9 November 2011.

[4] Rain Forest Puppy (9 September 1999). "Perl CGI problems". Phrack Magazine. artofhacking.com. 9 (55): 7. Retrieved 3 January 2016.

[5] "UTF-8, a transformation format of ISO 10646". Retrieved 19 September 2013.

[6] "Unicode/UTF-8-character table". Retrieved 13 September 2013.

[7] Kuhn, Markus. "UTF-8 and Unicode FAQ". Retrieved 13 September 2013.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

v t e C programming language
ANSI C C89 and C90 C99 C11 C18 Embedded C MISRA C
Features	Functions Header files Libraries Operators String Syntax Preprocessor Data types
Standard library functions	Char (ctype.h) File I/O (stdio.h) Math (math.h) Dynamic memory (stdlib.h) String (string.h) Time (time.h) Variadic (stdarg.h) POSIX
Standard libraries	Bionic libhybris dietlibc EGLIBC glibc klibc Microsoft Run-time Library musl Newlib uClibc BSD libc
Compilers	Comparison of compilers ACK Borland Turbo C Clang GCC ICC LCC PCC SDCC TCC Microsoft Visual Studio / Express / C++ Watcom C/C++
IDEs	Comparison of IDEs Anjuta Code::Blocks CodeLite Eclipse Geany Microsoft Visual Studio NetBeans
Comparison with other languages	Compatibility of C and C++ Comparison with Embedded C Comparison with Pascal Comparison of programming languages
Descendant languages	C++ C# D Objective-C Alef Limbo Go Vala
Category

v t e Data types
Uninterpreted	Bit Byte Trit Tryte Word Bit array
Numeric	Arbitrary-precision or bignum Complex Decimal Fixed point Floating point Double precision Extended precision Long double Octuple precision Quadruple precision Single precision Reduced precision Minifloat Half precision bfloat16 Integer signedness Interval Rational
Pointer	Address physical virtual Reference
Text	Character String null-terminated
Composite	Algebraic data type generalized Array Associative array Class Dependent Equality Inductive List Object metaobject Option type Product Record Refinement Set Union tagged
Other	Boolean Bottom type Collection Enumerated type Exception Function type Opaque data type Recursive data type Semaphore Stream Top type Type class Unit type Void
Related topics	Abstract data type Data structure Generic Kind metaclass Parametric polymorphism Primitive data type Protocol interface Subtyping Type constructor Type conversion Type system Type theory
See also platform-dependent and independent units of information

Null-terminated string

Contents

History[edit]

Implementations[edit]

Limitations[edit]

Character encodings[edit]

Improvements[edit]

See also[edit]

References[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Interaction

Tools

Print/export

Languages