(archive 'newLISPer)

July 25, 2008

Character reference

Filed under: newLISP — newlisper @ 22:43
Tags:

I was looking through an old (1990!) book on Unicode the other day. I’ve always been intrigued by the amazing diversity of letter forms that we’ve created over the last few thousand years. Here are just some of the many wonderful and peculiar characters you’ll find tucked away in the Unicode glyph banks:

۞ ⱁ ᚅ Ꮈ Ϣ ܍ ⫷ ⨸ ℻

You’ll also find the I Ching, Braille, alchemy, an alphabet funded by George Bernard Shaw, neo-pagan tree language, astrology, dentists, talking leaves, and much more besides.

Most of the technical aspects of Unicode escape me (supplementary planes, normalization, high surrogates, collation?) but it’s useful to know the basics of using Unicode in newLISP, particularly now that it’s the most popular encoding used on the internet.

newLISP is UTF-8 friendly by default on MacOS X, and UTF-8 versions are available for other platforms too (although I’m not sure whether the default versions are UTF-8). UTF-8 is a variable-length character encoding, which allows characters to use 1, 2, 3 or 4 bytes depending on their Unicode value.

One essential newLISP function for exploring the Unicode character set is char. This takes either a number or a character, and returns the matching character or number:

(char 63498)
""

(char "")
63498

Unicode characters are usually described using hexadecimal, so it’s useful to know how to translate between hex and decimal. To convert a decimal integer to a hex string, use format:

(format "%llx" 63498)
"f80a"

To convert a hex string to a decimal integer, pass a hexadecimal string starting with “0x” to int :

(int (string "0x" "f80a"))
63498

When you’re writing text, it would be good if you could easily insert these characters as you type. There are useful system tools for doing this (on MacOS X, there’s the Character Palette), but for fun I’ve added the following two functions to the Markdown converter that I use to process my writing:

(define (hex-str-to-unicode-char strng)
   (char (int (string "0x" (1 strng)) 0 16)))

(define (ustring s)
  (replace "U[0-9a-f]{4,}" s (hex-str-to-unicode-char $0) 1))

So now I can type “U” followed by 4 hexadecimal characters, and the appropriate Unicode character is inserted automatically: “U f80a” is converted to “”. (I had to insert a space after the U to prevent translation.)

You can happily use Unicode characters anywhere in newLISP code, if your text editor or console is up to the job. And if ustring is available, you can generate them easily too:

(constant (sym (ustring "U 2660")) 4  ; spades
       (sym (ustring "U 2661"))      3  ; hearts
       (sym (ustring "U 2662"))      2  ; diamonds
       (sym (ustring "U 2663"))      1  ; clubs
     )

(symbols)

(! != $ $0 $1 $10 $11 $12 $13 $14 $15 $2 $3 $4 $5 $6 $7 $8 $9 $HOME $args $idx $main-args ...  zero? | ~ ♠ ♡ ♢ ♣)

(println "(> ♢ ♣)? " (> ♢ ♣))
(> ♢ ♣)? true

(println "(> ♡ ♠)? " (> ♡ ♠))
(> ♡ ♠)? nil

Using descriptive Unicode characters for your symbol names could introduce a whole new level of readability to your code!

(constant (global '☼)  MAIN)
(context '☺)

(define (☻ ✄ ☁ ⍾)
   (print ✄ ☁ ⍾))

(define (‽)
   (println {‽}))

(context ☼)
(set '℥ "what "  'ᴥ "the " 'ᴒ "dickens")
(☺:☻ ℥ ᴥ ᴒ)
(☺:‽)

Appropriately enough, that last function call returns “‽”, which is the much-needed interrobang character.

The problem now is to remember all those four digit hexadecimal numbers that identify the Unicode characters. I whipped up a quick Unicode browser in newLISP:

This just shows a page of Unicode characters at a time, and lets you move up and down through the ‘pages’. It has some problems when the character code exceeds FFFF – I don’t know why‽

This post should display correctly on most modern browsers. If you see lots of boxes rather than characters, then you are using a browser or system that doesn’t handle Unicode well. This applies to the iPhone and iPod Touch as well: it appears that Mobile Safari doesn’t like Unicode as much as its desktop version. Apple – improve Unicode support please!

Advertisements

1 Comment »

  1. to type these characters on Mac OS X, select the Unicode Hex Input mode from the Keyboard menu icon, then hold down the option key and type the four digit hex code…

    Comment by newlisper — February 10, 2011 @ 09:24 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: