Tag Archives: AI


I put this on my website a long time ago, maybe around 1996, as an HTML page. This is it moved to my blog.



LISP is short for List Programming. It was created by John McCarthy circa 1960, and is used today mainly in AI. There are many dialects of LISP, though Common LISP is now the agreed standard.

LISP is an interpreted, functional language with no in-built Object-Orientated mechanisms (Note from 2015: ANSI Common Lisp does now). Type-checking is loose, and because the program is capable of creating and executing new code during run-time, type-checking occurs at run-time.

LISP is characterised by it’s unusual, but simple, notation. Everything in LISP is a list. Lists are space-separated lists of elements in parentheses. For instance, (a b c).

There is no distinction between the program and the data which the program uses. This allows programs to manipulate their own commands, which in turn allows programmers to act as meta-programmers.

A LISP program may create lists which it may then process. A List may contain other lists or atoms. All lists resolve to atoms, so all Lists which contain other lists are actually processed as lists of simple atoms. This method is intended to allow symbolic programming through use of recursive functions.

This neat concept enables straightforward coding of some classes of problems,  but
makes it harder to code most other programs. Data structure and process are emergent
rather than clearly defined, so the program architecture is not easily apparent.


Using functions

The first element in the list may be a function name or simple operator. The other
elements in the list are then regarded as the parameters to the function.

For instance, (* 2 3) or (foo 2 5).

Almost all expression in LISP are of this form. In fact, many of the languages’s keywords are simply pre-defined LISP functions, which must be used in the same manner as other functions. There are many of these built-in functions, which I will not describe here.

Preventing evaluation

If LISP attempts to evaluate a list which does not begin with a function name (For instance, (1 2 3)) then execution will halt with an error. When such lists are passed as parameters, the quote keyword (or its abbreviation: ‘ ) is used. This  prevents the interpreter from first attempting to evaluate the list into an atom. For instance, ( foo 2 ‘(a b c) ).

Building new lists

To return a list from a function, the list function may be used. Without using this, LISP would attempt to evaluate the list before it is returned.

For instance, (list a b c) evaluates to (a b c)

Note: Programs can force the evaluation of a list using the eval keyword.

Defining functions

The defun keyword can be used to define new functions. For instance,

( defun timesthree(x)
  (* x 3)

defines a function which multiplies a value by 3. The new function may be used as so:
(timesthree 2) evaluates to 6.



By using the convention that a zero value is false and a non-zero value is true, LISP allows conditional branching and boolean logic.

The cond function takes a series of condition-result pairs. This construct is similar to a Switch-Case block in C. For instance,

( defun abs-val(x)
  (cond ( (< x 0) (-x) )
        ( (>= x 0) x )


The if function takes an expression to be examined as true or false, and returns one of its two other parameters, depending upon the result. For instance,

(if (< x 0) (-x) x)

Boolean operators

The and and or functions act as boolean operators. Their left to right checking gives rise to a side-effect which is often used as a conditional branching technique. and stops checking when it encounters one item which is false, while or stops checking when it encounters one true item.


LISP has no loop constructs, so recursion is the only way to process data. A recursive function is a function which calls itself. This technique recognises that an operation on some data may be best expressed as the aggregate of the same operation performed on each item of data of which it is comprised. Obviously this technique is best used with data structures which have the same form at both higher and lower levels, differing only in scale.

This focus on recursion is the reason for LISP’s popularity with AI researchers, who often attempt to model large-scale behaviour in terms of smaller-scale decisions. For instance, recursion is often used in LISP to search state spaces.

Many of the lists used in LISP programs would be better referred to as trees. Lists are simply the mechanism used to represent those trees.

The car and cdr functions are generally used to recurse (or ‘walk’) through the elements in a tree, while cons is often used to gradually build build tree structures to form the result of a recursive operation. By also using the null function to test for an empty list, we can walk through the tree structure, dealing with successively smaller pieces of the tree.

car returns the first element of a list. For instance, ( car ‘(a b c) ) evaluates to a.

cdr returns the list with the first element removed. For instance, ( cdr ‘(a b
c) ) evaluates to (b c).

cons is an associated function which is used to build tree structures, often to form the result of a recursive operation. Note that it does not simply concatenate lists, but undoes the effects of a hypothetical use of car
and cdr. For instance, ( cons ‘(a b) ‘(c d e) ) evaluates to ( (a b) (c d e)) rather than (a b c d e).

Note that the use of these functions can lead to a great deal of inefficient copying.


Global variables

The set function sets the item referred to.  Note that the item does not need to be an explicitly named variable. For instance,

(set x ' (a b c) )
(set (car x) 1 ) - car x refers to 1st element in x.

x now evaluates to (1 b c).

Local variables

The let function declares a local scope.

The parameters to the let function are a list of local variables and a list of expression which may use these local variables. The let function effectively brackets the expressions, providing them with their own local variables. For instance,

(let (a b)

declares a and b as local variables, then evaluates the LISP statements contained in its list.

Bayesian Belief Networks

I put this on my website a long time ago, maybe around 1998, as an HTML page. This is it moved to my blog.



Expert systems often calculate the probabilities of inter-dependent events by giving each parent event a weighting. Bayesian Belief Networks provide a mathematically correct and therefore more accurate method of measuring the effects of events on each other. The mathematics involved also allow us to calculate in both directions. So we can, for instance find out which event was the most likely cause of another.

Bayesian Probability

Bayes’ Theorem

You are probably familiar with the following Product Rule of probability for independent events:

p(AB) = p(A) * p(B), where p(AB) means the probability of A and B happening.

This is actually a special case of the following Product Rule for dependent events, where p(A | B) means the probability of A given that B has already occurred:

p(AB) = p(A) * p(B | A)
p(AB) = p(B) * p(A | B)

So because: p(A) p(B | A) = p(B) p(A | B)
We have: p(A | B) = ( p(A) * p(B | A) ) / p(B) which is the simpler version of Bayes’ Theorem.

This equation gives us the probability of A happening given that B has happened, calculated in terms of other probabilities which me may know.

Note that: p(B) = p(B | A) * p(A) + p(B | ~A) * P(~A)

Chaining Bayes’ Theorem

We may wish to calculate p(AB) given that a third event, I, has happened. This is written p(AB | I). We can use the Product Rule: P(A,B) = p(A|B) p(B)

p(AB | I) = p(A | I) * p(B | AI)
p(AB | I) = p(B | I) * p(A | BI)

so we have: p(A | BI) = ( p(A | I) * p(B | AI) ) / p(B | I) which is another version of Bayes’ Theorem.

This gives us the probability of A happening given that B and I have happened.

This is often quoted as p(H | EI) = ( p(H | I) * p(E | HI) ) / p(E | I), where p(H | EI) is the probablity of Hypothesis H given Evidence E in Context I.

By using the product rule we can chain several probabilities together. For instance, to find the probability of H given that E1, E2 and I have happened:

p(H | E1E2I) = ( p(H | I) * p(E1E2 | HI) ) / p(E1E2| I)

and to find the probability of H given that E1, E2, E3 and I have happened:

p(H | E1E2E3I) = ( p(H | I) * p(E1E2E3 | HI) ) / p(E1E2E3 | I)

Note that p(E1E2E3 | I) = p(E1 | E2E3I) * p(E2E3 | I) = p(E1 | E2E3I) * p(E2 | E3I) P(E3 | I), which can be used to calculate two of the values in the above equation.

An example of Bayes’ Theorem

p(H | EI) = ( p(H | I) * p(E | HI) ) / p(E | I)
p(H | EI) = ( p(H | I) * p(E | HI) ) / ( p(E | HI) * p(H | I) + p(E | ~HI) * p(~H | I) )
H is the Hypothesis ‘Guilty’,
E is an item of evidence,
I is the context.

p(H | EI) is the probability of the Hypothesis ‘Guilty’ being true, given the evidence in this context.
p(H | I) is the Prior Probability – the subjective probability of the Hypothesis regardless of the evidence.
p(E | HI) is the probability of the evidence being true given that the Hypothesis is true.
p(~H | I) = 1 – p(H | I).
p(E | ~HI) is the probability of the evidence given that the hypothesis is not true – this measures the chances of the evidence being caused by something other than the defendant’s guilt. If this is high then naturally the hypothesis will be unlikely.

Assuming Conditional Independence

If, given that I is true, E1 being true will not affect the probability of E2 being true, then a simpler version of the chained bayesian theorem is possible:

p(H | E1E2I) = ( p(H | I) * p(E1 | HI) ) * p(E2 | HI) ) / ( p(E1 | I) * p(E2 | I) )

This version makes it very easy to introduce new evidence into the situation. However, Conditional Independence is only true in some special situations.

Prior Probabilities

One characteristic of Bayes’ Theorem is p(H | I), which is the probability of the hypothesis in context I regardless of the evidence. This is referred to as the Prior Probability. It is generally very subjective and is therefore frowned upon. This is not a problem as long the prior probability plays a small role in the result. When the result is overly dependent on the prior probability more evidence should be considered.

Bayesian Belief Networks

A Bayesian Belief Network (BBN) defines various events, the dependencies between them, and the conditional probabilities involved in those dependencies. A BBN can use this information to calculate the probabilities of various possible causes being the actual cause of an event.

Setting up a BBN

For instance, if event C can be affected by events A and B:


We may know the following probabilities:


True False
p(A) = 0.1 p(~A) = 0.9


True False
p(B) = 0.4 p(~B) = 0.6

C: Note that when depencies converge, there may be several conditional probabilites to fill-in, though some can be calculated from others because the probabilities for each state should sum to 1.










p(C | AB) = 0.8

p(C | A~B) = 0.6

p(C | ~AB) = 0.5

p(C | ~A~B) = 0.5


p(~C | AB) = 0.2

p(~C | A~B) = 0.4

p(~C | ~AB) = 0.5

p(~C | ~A~B) = 0.5

Calculating Initialised probabilities

Using the known probabilities we may calculate the ‘initialised‘ probability of C, by summing the various combinations in which C is true, and breaking those probabilities down into known probabilities:

p(C) = p(CAB) + p(C~AB) + p(CA~B) + p(C~A~B)
= p(C | AB) * p(AB) +
p(C | ~AB) * p(~AB) +
p(C | A~B) * p(A~B) +
p(C | ~A~B) * p(~A~B)
= p(C | AB) * p(A) * p(B) +
p(C | ~AB) * p(~A) * p(B) +
p(C | A~B) * p(A) * p(~B) +
p(C | ~A~B) * p(~A) * p(~B)
= 0.518

So as a result of the conditional probabilities, C has a 0.518 chance of being true in the absence of any other evidence.

Calculating Revised probabilities

If we know that C is true, we can calculate the ‘revised’ probabilities of A or B being true (and therefore the chances that they caused C to be true), by using Bayes Theorem with the initialised probability:

p(B | C) = ( p( C | B) * p(B) ) / p(C)
= ( ( p(C | AB) * p(A) + p(C | ~AB) * p(~A) ) * p(B) ) / p(C)
= ( (0.8 * 0.1 + 0.5 * 0.9) * 0.4 ) / 0.518
= 0.409
p(A | C) = ( p( C | A) * p(A) ) / p(C)
= ( ( p(C | AB) * p(B) + p(C | A~B) * p(~B) ) * p(A) ) / p(C)
= ( (0.8 * 0.4 + 0.6 * 0.6) * 0.1 ) / 0.518
= 0.131

So we could say that given C is true, B is more likely to be the cause than A.

Information Theory

I put this on my website a long time ago, maybe around 1997, as an HTML page. This is it moved to my blog.



Information is a property of data. A piece of data holds more information if its content is less expected. ‘Man bites dog’ contains more information than ‘Dog bites man’

The arrival of each new piece of data is an event. Intuitively, if the event is certain then it provides no information. If it impossible then it provides infinite information. We may represent the Information numerically by using the equation I = log(1/p), sometimes written as I = -log(p), where p is the probability of an event occurring and I is the information provided by that event. This equation satisfies our intuitive ideas about information by providing a value of zero for a certain event and infinity for an impossible event. The value I will never be negative. The base of the logarithm is chosen arbitrarily.

Bits as units of Information

When using logarithms of base 2 to calculate Information, e.g. I = log2(1/p), a value of 1 for I indicates that the event provides enough information to answer a simple yes/no question. There are obvious similarities with the binary system of 1s and 0s. Therefore telecommunications and computer scientists often use base 2 logarithms and refer to each unit of Information as a bit.

In theory any item of information could be conveyed by answering the correct series of yes/no questions. An efficient use of binary storage therefore asks the smallest necessary number of such yes/no questions.

Information in a system (Entropy)

The amount of information in a system is a measure of the number of possible states which it may have. A more disorganised system with more possible states has greater information and is said to have greater Entropy. Systems tend towards greater entropy, thus becoming more disorganised. The classic example is that of a volume of gas which tends to maximise its entropy.

Note that the amount of entropy in the universe can only increase. A system can only become more organised at the expense of increased disorder elsewhere, generally as a dissipation of heat due to work.

Information capacity

The information capacity of a data store is a measure of how many different states it can be in. For instance, an 8-bit byte can store 8 1s or 0s, in 256 possible combinations. 8 = log2(1/(1/256)) = log2(256). Note that some amount of power is always required to maintain the integrity of any data store because, like any system, it will tend towards disorder.

Similarly, the information capacity of a communications channel is a measure of how many states it can be in during a given time period, stated in bits per second. This is a theoretical maximum capacity which depends on the physical properties of the channel rather than the particular method of coding the data. In theory the channel would actually convey information at the maximum capacity if the data was coded in the most compacted form possible.

Signal to noise ratio

Information theory matured in the field of telecommunications, where all communications channels contain some amount of useless noise.

The term is now often used slightly differently to refer to how compactly a message expresses its information. The English language has a high signal-to-noise ratio because in theory many letters and words could be omitted without the reader understanding less.

State Space Search

I put this on my website a long time ago, maybe around 1997 as an HTML page. This is it moved to my blog.

The concept of State Space Search is widely used in Artificial Intelligence. The idea is that a problem can be solved by examining the steps which might be taken towards its solution. Each action takes the solver to a new state.

The classic example is of the Farmer who needs to transport a Chicken, a Fox and some Grain across a river one at a time. The Fox will eat the Chicken if left unsupervised. Likewise the Chicken will eat the Grain.

In this case, the State is described by the positions of the Farmer, Chicken, Fox and Grain. The solver can move between States by making a legal move (which does not result in something being eaten). Non-legal moves are not worth examining.

The solution to such a problem is a list of linked States leading from the Initial State to the Goal State. This may be found either by starting at the Initial State and working towards the Goal state or vice-versa.

The required State can be worked towards by either:

  • Depth-First Search: Exploring each strand of a State Space in turn.
  • Breadth-First Search: Exploring every link encountered, examining the state space a level at a time.

These techniques generally use lists of:

  • Closed States: States whose links have all been been explored.
  • Open States: States which have been encountered, but have not been fully explored.

Ideally, these lists will also be used to prevent endless loops.

Symbolic Logic

I put this on my website a long time ago, maybe around 1996, as an HTML page. This is it moved to my blog.

The structure of this explanation is lifted from ‘An Introduction to Symbolic Logic’ by Susanne K. Langer, without her permission.



The things or material in a system.


The way in which the contents are related in a system.


Separating Form from Content, sometimes by discovering analogies.


Finding possible Content for Forms.



Number of Elements used by a Relation. e.g. Dyadic: ‘is north of’, Triadic: ‘is


Asserts that the Elements are related by the Relation. e.g. ‘Edinburgh’ nt ‘Swindon’.


=int means ‘equals by interpretation’. e.g ‘nt2’ =int ‘is north of’.~ means ‘is not
true’. e.g. ~’Swindon nt Edinburgh’.



Consists of elements and relations.

e.g. K(‘Brighton’, ‘Swindon’, ‘Edinburgh’) nt2 =int ‘cities’2 =int ‘is north of’

Universe Of Discourse:

All Elements in the context. e.g. K(a,b,c,…) K=int ‘cities’

Constituent Relations:

The relations used in the context. e.g. nt2 =int ‘is north of’

Elementary Propositions:

The statements that may be made by relating Elements.

e.g. ‘Edinburgh’ nt ‘Swindon’.

Truth Value:

Whether the Elementary Proposition is true or false. e.g. ‘Swindon’ nt ‘Edinburgh’ is false.


⋅ means Conjunction (‘and’)
∨ means Disjunction (‘or’)
⊃ means Implication (‘implies that’)

Logical Relations:

When the truth of one Elementary Proposition is dependent upon the truth of others they
are Logically Related.

e.g. (‘Edinburgh’ nt ‘Swindon’) ⋅ (‘Swindon’ nt ‘Brighton’) ⊃ (‘Edinburgh’ nt ‘Brighton’)

System of Elements:

Context with Elementary Propositions connected by Logical Relations



As in algebra. e.g. x means ‘Edinburgh’ or ‘Swindon’ or ‘Brighton’ etc.

Allows us to summarise Logical Relations of the same form.

e.g. (a nt b) ⋅ (b nt c) ⊃ (a nt c)


The Logical Relation may hold for All or Some of the variables.

Universal Quantifier:
(a) means ‘for all a’. e.g. (a) : (a nt b) ⊃ ~(b nt a)

Particular Quantifier:

(∃a) means ‘for at least one a’. e.g. (∃a): (a nt ‘Swindon’)

Propositional Form:

Elementary Proposition or Logical Relation using Variables, whose Truth Value would depend upon the actual Elements substituted. e.g ‘a nt b’.

General Proposition:

Propositional Form with Quantifiers.



∈ means ‘is a member of’. e.g. ‘Murray’ ∈ B, where B =int ‘Class of Humans’.

General Propositions (see above) concern Classes of Elements.

Defining Form:

Defines the Class in terms of Propositional Forms. The Class contains all elements for which the Propositional Form is True.


Meaning of a concept. e.g. the class ‘town’.


The elements to which the concept applies. e.g. the class of ‘towns’.

N.B. Classes with unrelated Intensions may share some Elements in Extension.

e.g. ‘Towns north of Swindon’ and ‘Towns with Universities’.


(x): (x ∈ A) ⊃ (x ∈ B) means Class A is included in Class B, by stating that any Element in A is therefore in B.

Unit Class (I):

Has one member, meaning that if two Elements are both in A then they must both be the
same Element.

(∃x) (y): (x ∈ A) ⋅ [ (y ∈ A) ⊃ (x=y) ]

The Null Class (o):

Has no Elements. There is a single Null Class, because two null classes could not be distinguished.

The Universe Class:

Contains all Elements. There is a single Universe Class.

Mutual Inclusion:

Classes have same Elements so each Class is included in the other.

(x): [ (x ∈ A) ⊃ (x Î B) ] ⋅ [ (x ∈ B) ⊃ (x ∈ A) ]

Class symbols:

< means Inclusion. e.g. A < B means (x): (x ∈ A) ⊃ (x ∈ B)
X means Conjunction (and). e.g. A X B means the Elements which are in A and in B. X is often omitted e.g. AB
+ means Disjunction (or). e.g. A + B means the Elements which are in A or in B.
– means Complement. e.g. -A means the Elements not in A.
= means Mutual Inclusion e.g. A = B means (A < B) . (B < A)

N.B. A<A, A<I, 0<A

Mutual Exclusion:

Classes have no Elements in common.

A X B = o


The fact that I = A + -A.


A < -B means A is excluded from B.



e.g. A, -A, A X B, A + B

Predicative Propositions:

Propositions about Predicates

System of Classes:

K(a,b,c…) <, similar to System of Elements but with Classes instead of Elements and < as the constituent relation.

Dots instead of brackets:

e.g. :. a : b . cd : e instead of ( a . ( b . ( c . d) ) . e)

You’ll get the hang of it.


Calculus of Classes:

Describes the System of Classes for all Classes, just as the Calculus of Numbers
describes the System of Numbers for all Numbers.

Shows how to deduce some Propositions from others.


Basic Propositions of the system. e.g. (a, b) . a + b = b + a.

There are ten Postulates of the Calculus of Classes, analogous to the uses of Venn Diagrams.


Self-evident Postulates that are assumed because they can not be deduced.


Boolean Algebra of Classes

Generalised Calculus of Classes, just as Algebra of Numbers is generalised Calculus of

Laws of Duality

Conjunction can be defined in terms of Disjunction and vice-verse.

Primitive Propositions

The propositions used to prove theorems in Boolean Algebra. They may use either
Conjunction or Disjunction. They show the following:

Operational Assumptions:

Existence of complements, sums, products.

Existential Assumptions:

Existence of Universe Class, Null Class, more than one Class.

Laws of Combination:

Tautology e.g. ‘a + a = a’
Commutation e.g. ‘a + b = b + a’
Association e.g. ‘(a + b) + c = a + (b + c)’
Distribution e.g. ‘a + (b X c) = (a + b) X (a + c)’
Absorption e.g. ‘a + ab = a’

Laws of the Unique Elements:

Universe Class e.g. ‘a + 1 = 1’
Null Class e.g. ‘a + 0 = 0’

Laws of Negation:

Complementation e.g. ‘a + -a = 1’
Contraposition ‘a = -b . ⊃ . b = -a’
Double Negation e.g. ‘a = -(-a)’
Expansion e.g. ‘ab + a-b = a’
Duality e.g. ‘-(a + b) = -a X -b’


System in Abstracto.

K R, where K is the universe of something and R is some way of relating these things.

Properties of Relations:

Reflexiveness e.g. ‘(a). a R a’
Symmetry e.g. ‘(a, b). a R b . ⊃ . b R a’
Transitivity e.g. ‘(a, b, c): a R b . b R c . ⊃ . a R c’


Propositional Calculus.

Uses a Universe of Propositions which are either True (1) or False (O)

p means ‘p is true’ or ‘p=1’, leading to ‘p=(p=1)’.


Calculus of Elementary Propositions.

Used by Principia Mathematica by Russel & Whitehead. Improves on above flawed notation.

† means ‘it is asserted that’ e.g. ‘†: p V q . ⊃ . q V p;


Function and Argument:

A Proposition consists of a Function and Arguments.

e.g. Ï•x instead of p, where f is the function and x is the argument.

We may quantify the argument instead of the whole proposition to show that functions which are not identical are formally equivalent, allowing us to express ‘(x): mortal(x) =


Seeks to create a logical foundation for mathematics.

Recommended Reading:

An Introduction to Symbolic Logic: Susanne K. Langer

Godel, Escher, Bach : An Eternal Golden Braid, Douglas R. Hofstadter