(archive 'newLISPer)

December 24, 2008

Spellbound

Filed under: newLISP — newlisper @ 22:57
Tags:

Update I think he’s updated his table, and unsorted the results. No problem. :)

I’ve been enjoying Peter Norvig’s article in which he explores ways of writing software to correct typos and spelling mistakes in English text. You can find it here.

His general idea is to deliberately misspell a wrongly-spelled word in all kinds of simple ways, such as transposing two letters, omitting a letter, including an extra letter, and so on. One of the ‘misspellings’, if found in a suitable dictionary, is quite likely to be the correct spelling of the original word.

Although it’s not the primary point of his article, I can’t help noticing that he’s secretly quite proud of the brevity of his Python script (“So here, in 21 lines of Python 2.5 code, is the complete spelling corrector”), and he included at the end the current score table of the number of lines of code required by different languages

I don’t really get this “lines of code” thing. That’s partly because I’m more used to talking about brevity in terms of words or characters, rather than lines. A journalist, asked to produce 150 lines of copy, would immediately ask “How many words in a line?’. And in newLISP (and probably many other languages), there’s no real restriction on formatting, because white space is provided for the benefit of the human reader or writer rather than for the machine interpreter or compiler. In one sense, all newLISP programs occupy a single line, since a linefeed is just another way for the human writer to format their code and generally has no significance for the interpreter.

Brevity may well be the soul of wit, but it can also be the first step to unreadable code. A better way to compare the brevity of scripts would be to count how many significant keywords were required, or even how many characters were used, excluding identifiers, which you would obviously want to be expressive and informative rather than cryptic, just for the sake of saving a few letters.

Just for fun, though, I couldn’t help but take up the implicit challenge of writing a newLISP version of Peter Norvig’s Python script, to see if I could get it as concise without being unintelligible or unidiomatic. I don’t know any Python at all, but I felt that his script was as compressed as he could make it without cheating (nested loops on the same line, for example), so I felt I could adopt a similar approach for newLISP.

Here’s the Python:

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=lambda w: NWORDS[w])

As far as I can understand the original, here’s a version in newLISP:

(bayes-train (map lower-case (parse (read-file {big.dms}) {\W} 0)) 'Lexicon)
(define (edits1 word)
 (let ((l (length word)) (res '()))
  (for (i 0 (- l 1)) (push (replace (nth i (string word)) (string word) "") res -1))
  (for (i 0 (- l 2)) (push (swap i (+ i 1) (string word)) res -1))
  (for (i 0 (- l 1)) (for (c (char "a") (char "z"))
    (push (replace (nth i (string word)) (string word) (char c)) res -1)
    (push (replace (nth i (string word)) (string word) (string (char c) (nth i (string word)) )) res -1)))
  res))
(define (edits2 word) (unique (flat (map edits1 (edits1 word)))))
(define (known-edits2 word) (filter (fn (w) (Lexicon w)) (edits2 word)))
(define (known words) (filter (fn (w) (Lexicon w)) words))
(define (correct word)
  (let ((candidates (or (known (list word)) (known (edits1 word)) (known (edits2 word)))))
    (first (sort candidates (fn (w1 w2) (> (Lexicon w1) (Lexicon w2)))))))

And to test it:

(join (map correct (parse "tihs sentnce is ful of inkorrect spelingz")) " ")

;-> this sentence is full of incorrect spelling

(By the way, this newLISP code runs on versions 9.4 and higher, but doesn’t run on 9.3.)

If you allow me the single-line function definitions (like his words function), then I reckon newLISP is about 15 non-blank lines of code.

In fact, character and word counts are pretty similar. And I think the Python version of edits1 is neater than mine – I’m struggling with my string indexing at the moment. But newLISP has bayes-train batteries built in, and it doesn’t have to import regular expressions or collections. I have no idea whether it runs fast or slow – I don’t have the time to optimize size and performance, or to check whether the code is doing what I thought it was.

As I said, it’s just for fun, so don’t take it seriously, and I won’t try to knock Peter’s score off the top on his own web page. But, secretly, just between you and me:

Language Lines of Code Author (and link to implementation)
newLISP 15 cormullion
Awk 28 Gregory Grefenstette
C 184 Marcelo Toledo
C++ 98 Felipe Farinon
C# 22 Lorenzo Stoakes
Clojure 18 Rich Hickey
D 23 Leonardo M
Erlang 87 Federico Feroldi
F# 16 Dejan Jelovic
F# 34 Sebastian G
Groovy 23 Rael Cunha
Haskell 24 Grzegorz
Java 36 Rael Cunha
Java 372 Dominik Schulz
Perl 63 riffraff
PHP 68 Felipe Ribeiro
PHP 103 Joe Sanders
Python 21 Peter Norvig
Rebol 133 Cyphre
Ruby 34 Brian Adkins
Scheme 45 Shiro
Scheme 89 Jens Axel

:-)

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.