NSStringEncoding Considered Harmful

September 1, 2017

By Jeff Johnson

It appears that [[NSString alloc] initWithData:data encoding:NSASCIIStringEncoding] never returns nil. I learned of this from Cédric Luthi. For instance, the following code unexpectedly returns an NSString with the copyright © symbol, which is non-ASCII:

    char bytes[] = { 0xA9 };
    NSData *inData = [NSData dataWithBytes:bytes length:sizeof(bytes)];
    NSString *string = [[NSString alloc] initWithData:inData encoding:NSASCIIStringEncoding];
    NSLog( @"%@ %@", inData, string );

Ironically, the reverse transformation doesn't have the same result. This returns nil:

    NSData *outData = [string dataUsingEncoding:NSASCIIStringEncoding];
    NSLog( @"%@ %@", string, outData );

The documentation for NSASCIIStringEncoding is clearly false: "Strict 7-bit ASCII encoding within 8-bit chars; ASCII values 0…127 only." The NSString.h header file contains the same falsehood:

    NSASCIIStringEncoding = 1,		/* 0..127 only */

Curiously, though, the CFString.h header file has a more useful comment:

    kCFStringEncodingASCII = 0x0600, /* 0..127 (in creating CFString, values greater than 0x7F are treated as corresponding Unicode value) */

So, it turns out that our API for ASCII string encoding is … not ASCII. Now isn't that special?

What about NSNonLossyASCIIStringEncoding? You might think that's what you'd want to use instead, but some information from Peter Hosey made clear to me that it's almost assuredly not what you want. Peter noted that NSNonLossyASCIIStringEncoding will parse escape sequences. Consider the following code:

    char bytes[] = { 0x5C, 'u','0','0','A','9' };
    NSData *inData = [NSData dataWithBytes:bytes length:sizeof(bytes)];
    NSString *string = [[NSString alloc] initWithData:inData encoding:NSASCIIStringEncoding];
    NSString *stringNonLossy = [[NSString alloc] initWithData:inData encoding:NSNonLossyASCIIStringEncoding];
    NSLog( @"%@ %@ %@", inData, string, stringNonLossy );

And here's the output:

    <5c753030 4139> \u00A9 ©

Wow, no, that's probably not what you want.

Peter says that there isn't currently a strict-ASCII encoding. Thus, if you want to ensure that your input data is ASCII, you'll have to check the data yourself. On the other hand, it should still be safe to use NSASCIIStringEncoding on data that you know is ASCII, for example, the output of the function inet_ntop.

I think — I hope — that [NSString dataUsingEncoding:] works as expected with NSASCIIStringEncoding. As far as I can tell, it does return nil for non-ASCII strings. If anyone finds differently, please let me know. Otherwise, you'd have to double-check the output data yourself too.