(archive 'newLISPer)

November 17, 2009

newLISP Bayesian Comment Spam Killer

Filed under: newLISP — newlisper @ 19:35
Tags:

This post describes the newLISP Bayesian Comment Spam Killer. It won’t kill Bayesian comments – although it might – but it tries to kill spam comments on blogs, using Bayesian analysis.

The story starts after the aspiring commenter clicks the Submit button on the comment form, and after the CGI script or web framework has extracted the information from the commenter’s posted submission. To makes things easy, here are some declarations that get me quickly to the same position:

(set 'comment-date "20091114T163223Z")
(set 'storyid "projectnestorpart1")
(set 'comment "Very nice site!")
(set 'commentator "svQrVW a href=\"http://asdfhh.com/")
(set 'commentator-uri "svQrVW a href=\"http://asdfhh.com/")
(set 'ip-address ("94.102.60.174"))

The first thing to do is to save this information in a file. There are many ways to do this, but I like to save data in newLISP format wherever possible, because it saves time and effort when reading it back in:

; make a suitable path name
(set 'path (string {/Users/me/blog/comments/}
    story-id "-" comment-date ".txt"))
; save as association list
(set 'comment-list
     (list
        (list 'comment-date comment-date)
        (list 'storyid storyid)
        (list 'commentator commentator)
        (list 'comment comment-text)
        (list 'ip-address ip-list)
        (list 'status "spam")
        (list 'commentator-uri commentator-uri)))
(save path 'comment-list)

A few weeks after opening a comments form to the intelligent citizens of cyberspace, there will be hundreds of little newLISP files in the directory, containing all kinds of comment. Each file looks something like this:

(set 'Comments:comment-list '(
  (Comments:comment-date "20091114T163223Z")
  (Comments:storyid "projectnestorpart1")
  (Comments:comment "svQrVW a href=\"http://asdfhh.com/ etc etc ")
  (Comments:commentator "svQrVW a href=\"http://asdfhh.com/")
  (Comments:commentator-uri "svQrVW a href=\"http://asdfhh.com/")
  (Comments:ip-address ("94.102.60.174"))
  (Comments:status "spam")
  ))

I’ve added a status tag to each one, with the default value of “spam”. That means that every comment so far is considered spam. That’s not good (although very close to the actual truth), so I must also manually alter any genuine comments and tag them as “approved”. That’s a vital task, and for a while I did it by hand, until the collection of comments was large enough for me to trust the Bayesian analysis to do it automatically.

Once I’ve got a reasonable collection of comments, I’m ready to start building the Comment Spam Killer.

(context 'Comments)

A little macro I’ve been using recently provides a modified append:

(define-macro (extend)
  (setf (eval (args 0)) (append (eval (args 0)) (eval (args 1)))))

This accepts a symbol holding a list, and a list, and adds the elements in the list at the end of the symbol’s current elements.

I want somewhere to store the analysis:

(define MAIN:spam-corpus)

This function extracts a list of the words used in all the comments:

(define (build-word-lists dir)
    (dolist (nde (directory dir {^[^.].*txt}))
       (if (directory? (append dir nde))
         ; directory, recurse
         (build-word-lists (append dir nde "/"))
         ; file: read info and make a list of its contents
         (letn  ((file            (string dir nde))
                 (comment-list    (load file))
                 (commentator     (lookup 'commentator comment-list))
                 (comment         (lookup 'comment comment-list))
                 (comment-status  (lookup 'status comment-list))
                 (commentator-ip  (lookup 'ip-address comment-list))
                 (commentator-uri (lookup 'commentator-uri comment-list))
                 (word-list '()))
              (extend word-list (parse commentator       "[^A-Za-z]" 0))
              (extend word-list (parse comment           "[^A-Za-z]" 0))
              (extend word-list (parse commentator-uri   "[^A-Za-z]" 0))
              ; sometimes ip addresses are stored in a list...
              (if (list? commentator-ip)
                  (dolist (i commentator-ip) (extend word-list (list i))))
              (cond
                  ((= comment-status "approved")
                      (extend genuine-comments (clean empty? word-list)))
                  ((= comment-status "spam")
                      (extend spam-comments (clean empty? word-list))))))))

And the two lists can be turned into a Bayesian-ready dictionary with:

(bayes-train spam-comments genuine-comments 'MAIN:spam-corpus)

The resulting spam-corpus is a context that provides two numbers for each word in the comments. Here’s an informative extract:

 ;         
 ("prepended" (0 2))
 ("prescription" (36 0))
 ("present" (0 1))
 ("presepe" (3 0))
 ("pretty" (0 1))
 ("price" (2 0))
 ("primari" (3 0))
 ("primaria" (6 0))
 ("primitive" (0 1))
 ("princessdc" (2 0))
 ("print" (0 2))
 ("printing" (4 0))
 ("println" (0 5))
 ("prior" (0 1))
 ("priors" (0 3))
 ;...

The contents of the spam context hold a list of words and the number of times that each word occurs in the first category, the spam comments, or the second category, the genuine comments. The apparent discrepancy between print and printing is easily resolved once you look at the original comments – something to do with custom T-shirt printing, whereas print was twice mentioned in a piece of newLISP code in a comment.

(define (analyse-comment file)
    (letn ((comment-list    (load file))
           (commentator     (lookup 'commentator comment-list))
           (comment         (lookup 'comment comment-list))
           (comment-status  (lookup 'status comment-list))
           (commentator-ip  (lookup 'ip-address comment-list))
           (commentator-uri (lookup 'commentator-uri comment-list))
           (word-list '())
           (spam-comments '())
           (genuine-comments '()))
        (extend word-list (parse commentator       "[^A-Za-z]" 0))
        (extend word-list (parse comment           "[^A-Za-z]" 0))
        (extend word-list (parse commentator-uri   "[^A-Za-z]" 0))
        (if (list? commentator-ip)
            (dolist (i commentator-ip) (extend word-list (list i))))
        (clean empty? word-list)
        (set 'spam-score (bayes-query word-list 'MAIN:spam-corpus))))

which returns a double-valued spam score for each comment. The two numbers are the probabilities that a comment belongs in the first or second category.

It’s now easy to decide whether to reject a comment based on the two numbers returned by this function. The example I started with manages to score (1 0), a clear indication that this apparently harmless phrase is, when considered as part of a comment as a whole, usually a comment from a spammer.

If you’re wondering where the comments form is on this site – well, there isn’t one; I decided against using up disk space storing hundreds of unwanted comments!

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: