Bring back STX and ETX

Some people think that XML is the greatest invention since sliced bread. (I won’t name names, to protect the non-existent.) In contrast, I think that it’s just a symptom of a disease, a terminal illness infecting the entire computing world. Programmers are supposed to be smart, but we’re actually the ones who are responsible for the spread of the disease. We started and promoted the practice of using printable text to delimit printable text. In my haughty opinion, this was one of the worst ideas in the history of computers (surpassed only by SMTP and MacAppADay). It was doomed to fail from the beginning, kind of like the paradox of the liar. The use of printable text to delimit printable text has been the cause of countless bugs — let’s say 500 billion — and indeed, countless security vulnerabilities. It continues to plague us today.

Having worked on a feed reader, I do know a thing or two about this issue. I can tell you that it’s a major pain to parse XML feeds. Parsing HTML is even worse, but thankfully we can leave the majority of that to WebKit. It’s hard enough when everything is perfect, but we inevitably run into issues where the text is improperly escaped or not properly escaped. This is no fun for anyone.

Since the beginning, ASCII contained a number of non-printing control characters, but for some reason they have fallen out of favor. Among the control characters are STX (0x2) and ETX (0x3). Their position in the list of character codes indicates their importance: they were used to delimit text. With character codes such as these, parsing data into strings becomes trivial:

  1. Start parsing a string when you see a STX code.
  2. Continue until you see a ETX code, you see a non-character code, you reach a preset maximum length, or you run out of data.
  3. If the last code was ETX, you’ve got a good string. Otherwise, you’ve encountered an error, and you can do whatever error handling you like.
  4. There are no more steps. The characters in the string are all literal, no unescaping necessary.

Unicode has added some similar codes such as SOS and ST. I’d like to see even more control codes, to allow for fine-grained specification of the structure of the text. For example, we could have control codes to delimit words, sentences, paragraphs, etc. This would be similar to tags in HTML but without the use of printable characters to represent the tags.

Why don’t we do this now? One objection is that files containing control characters are not human readable. I think that this is a lame excuse, because no computer file is human readable. Although my hard drive is enclosed, preventing me from examining the files on there, I have burned text files to DVD, and no matter how long I stare and squint at the shiny bottom, all I can see is my own reflection. Anyway, a lot of markup is human readable only in the sense that Derrida is human readable: there is a series of legible text characters, but do you really want to wade through all the crap to make sense of it?

Perhaps the real point underlying this objection is that control-character delimited text would not be readable by simple (i.e., dumb) text editors. This is true, but why should we be ruled by the lowest common denominator? Many modern text editors are quite intelligent and could handle the new format easily. They can already parse various forms of syntax and highlight them for the user. Let’s not let backward compatibility hold us back. That’s certainly not the Apple Way. It’s not entirely the Microsoft Way either; after all, the Word file format makes no concession to simple text editors. Neither does the cross-platform Adobe PDF.

The most powerful objection to using control characters as text delimiters is that we shouldn’t force users to learn how to input control characters along with text. I agree, which is why I think the burden should be placed on computer programs — the text editors and command line interpreters — rather than on users. When taking text input from users, an app should do the following:

  1. Use the context to guess the user’s intention.
  2. Give a visual indication of the guess to the user. Syntax coloring is one example, but the possibilities are endless. Be creative.
  3. Make it easy for the user to correct bad guesses.

In command line interpreters, by the way, there’s no good reason why the space key needs to separate arguments, as opposed to a key for a non-printable character such as escape. It’s the 21st century, by Jove, and we should be finally be able to use any printable character in a file name, including colons, quotes, slashes, and spaces, without having to do voodoo on the command line just to refer to it! (I won’t even mention hierarchical file systems, which are themselves a bad idea. Oops, I just did. Since I mentioned it, the ideal behavior when a user enters a file name is to quickly find the named file or files, which any decent file system should be able to do, and show a visual preview so that the user can verify or choose the correct file, if necessary.)

This rant has been brought to you by BBEdit. The makers of BBEdit, I assume, take no responsibility or credit for the content here, nor do they endorse the opinions I’ve expressed. (Or do they?)

(No, not as far as I know, which is nothing. In any case, I do endorse BBEdit.)

5 Responses to “Bring back STX and ETX”

  1. Speaking of parsing, Vienna 2.1.0.2108.218.21.1.21.8.21.821 can’t seems to get confused on these:

    feed://gigliwood.com/weblog/?flav=rss
    feed://mikezornek.com/feed/
    feed://boredzo.org/blog/feed/atom/

    In all three cases I can see article titles but not the actual article contents. Safari displays all three correctly.

    Because I know you love getting bug reports on your blog.

  2. Jeff says:

    Scott, I can’t reproduce the problem on my machine. Perhaps we could continue the conversation here. :-)

  3. Peter Hosey says:

    And for a lightweight database format, I recommend using the FS and RS characters (field and record separator).

  4. [...] I use the WP-Cache plug-in here on the blog in case that anything I post here should get dugg/linked-listed/reddited/etc. Over on Jeff Johnson’s blog, Scott Stevenson says: … Vienna … seems to get confused on these: [...]

  5. jpc says:

    If you want to use smart software why bother with codings that are not 8-bit clean? Just use netstrings[1] everywhere?

    [1] http://cr.yp.to/proto/netstrings.txt