(archive 'newLISPer)

March 24, 2007

and counting

Filed under: newLISP — newlisper @ 19:18

Here’s an unusual sight on this blog: a Perl script. This one counts the number of times the word “and” appears in Sir Arthur Conan Doyle’s great Sherlock Holmes novel, “The Hound of the Baskervilles”, which I’ve downloaded in plain text form from (Project Gutenberg).

#!/usr/bin/perl -w
@ARGV = ("/Users/me/hound-of-baskervilles.txt");
$count = 0;
while (<>) {
    foreach my $wd (m/\band/ig) {
print $count,"\n";

This works OK, and it’s quick too, taking about 75 milliseconds on my machine to produce the answer 1361.

(Perl’s pretty handy stuff, really, and I’ve written some in my time, although I confess I’ve never really got on well with it. It looks like there’s been an explosion in a punctuation mark factory. But Larry Wall, Perl’s creator, is a cool guy, and funny, too:

Lisp has all the visual appeal of oatmeal with fingernail clippings mixed in.

Python’s syntax succeeds in combining the mistakes of Lisp and Fortran. I do not construe that as progress.

If I don’t document something, it’s usually either for a good reason, or a bad reason. In this case it’s a good reason.

No, I’m not going to explain it. If you can’t figure it out, you didn’t want to know anyway…

and there’s plenty more!)

Anyway, what about a “fingernail clipping in oatmeal” version?

  (set 'path "/Users/me/hound-of-baskervilles.txt")
  (set 'file (open path "read"))
  (set 'count_ 0)
  (while (read-line file)
    (set 'line (parse (current-line) "\\W" 1))
    (dolist (i line)
      (if (= i "and")
          (inc 'count_))))
  (println count_)

Hmm, it works, but it doesn’t look good. And it’s slow, too – about 1.2 seconds. Each line of the file is parsed, and split at non-word characters, and then the resulting list of words is tested, one by one. Plainly it’s not a good approach.

Perhaps it would be better to make use of newLISP’s count function.

; file and path are the same as before...
(set 'count_ 0)
(while (read-line file)
   (set 'line (parse (current-line) {[[:^alpha:]]} 1))
   (inc 'count_ (first (count '("and") line))))
(println count_)

Looks a bit better, and a bit more Lisp-y, but still about 1.3 seconds. I’ve used a POSIX-style character class as well, just for a change.

Is it the parse that’s taking some time? Let’s try not parsing strings into lists:

; file and path are the same as before...
(set 'count_ 0)
(while (read-line file)
  (replace "\\Wand\\W" (current-line) (inc 'count_) 1))
(println count_)

This uses replace with an active replacement expression to increase the counter each time the word is found. It looks more compact, but it’s still too slow, at just over a second.

Or perhaps it’s not parsing that’s the problem, but the way we read the file line by line? Perhaps Perl is faster at that than we are. So let’s not do that; we’ll parse the whole file in one go:

; file and path are the same as before...
  (count '("and") (parse (read-file path) {\s+} 1)))

Ah. This is 0.6 seconds now, and it looks quite good too. For a change, I used curly brackets to tell parse to split the text up at white space. Doing files in one go like this might give us problems with extremely large documents. Having said that, I tried it without problems on a 4 million word (20 MBytes) document, so it’s OK for smaller stuff such as “War and Peace” or the Bible, which both weigh in at less than 1 million words each.

I still think that parse is doing more work than necessary – after all, we don’t need the resulting list for further processing; we just want the number of occurrences of the string. So what about a parse-less whole-file version:

; path is the same as before...
(set 'count_ 0)
(replace "\\Wand\\W" (read-file path) (inc 'count_) 1)
(println count_)

Now we’re down to 100 milliseconds, within sight of our Perl target. Let’s combine the successful elements from our previous attempts to get an idiomatic and fast solution:

; path is the same as before...
(println (length (find-all "(?i)\\band" (read-file path))))

Now it’s less than 50 milliseconds, and an easy to read one liner, too. The (?i) is another way to specify case-insensitivity; with find-all the pattern of arguments is different, so it’s useful to put the option there rather than at the end. And there are fewer fingernail clippings than in the Perl version (if you include curly braces)! ;-)

I’m sure the Perl script can be made faster (and punctuation-free) too – but that’s a job for a Perl-lover… Besides, this isn’t really a post about performance, but about my ongoing search for idiomatic, Lisp-y solutions, which are likely to be shorter, faster, more reliable – and perhaps easier to write too, eventually.

By the way, to time newLISP expressions, use time – see Time Flies for another post about timing expressions.

And I ought to say that the best newLISP solution was provided by Mr newLISP himself. Thanks, Lutz!


1 Comment »

  1. >while i am a great fan of newlisp a perlhacker would probably do something like:print scalar (grep {/\band/g} ())which is equally fast on my machine

    Comment by rschmutz — May 30, 2007 @ 00:54 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: