Probability Notes
From FraudWiki
The following is a quick gallop through some important topics on Probability.
Contents
|
Probability
There are two main definitions of probability:
- Objective (aka Frequentist): - The probability of an event is a real property that can be measured by the relative frequency of its occurrence:
where
is the frequency of x in n independent trials.
- Subjective (aka Bayesian): - The probability of an event is a measure of an observer's degree of belief or uncertainty that the event will occur rather than having any external significance.
There are other definitions such as that due to Bruno de Finetti but it should be noted that all are governed by the same rules of probability and that in the following discussion no assumptions are made about which definition is most appropriate.
Conditional Probability
The conditional probability of an event A given that B has occurred is defined as:
and where
then A is independent B.
and given the independence of x
In the context of credit-card fraud we have a binary random variable Z = {legal,fraud}
and in much that follows it is important to note that although P(legal) + P(fraud) = 1.0
in general
for some variable x and where the events legal and fraud are independent of x then the sum is 1.0
This is a necessary but not a sufficient condition for independence (see next section)
Total Probability Theorem
If we assume that the event A is drawn from a binary set of mutually-exclusive events like {legal, fraud} then we know thatP(legal) + P(fraud) = P(A) + P(!A) = 1. However, as we also know that either legal or fraud must occur we can also say that P(A | B) + (!A | B) = 1
So, as
and equally
we can write
and then using P(A | B)P(B) = P(B | A)P(A) we have P(B | A)P(A) + P(B | !A)P(!A) = P(B)
This result is true because we imposed the two conditions on A:
1 It is a set of mutually exclusive events
2 One event must occur
We can then generalise from the two-valued event to a multi-valued event A so that
Chain rule of conditional probabilities
From the definition of conditional probability
and hence more generally we can write:
Conditional Independence
An event A is defined to be conditionally independent of event C given B if
from this definition we can easily see that
Marginalisation
Marginalisation is a very useful technique for getting rid of irrelevant random variables from a computation. By using the total probability theorem derived in A.1.4
Bayes Rule
From the definition of conditional probability we have:
and so can write:
which is the basic form of Bayes rule.
Bayes Rule is often interpreted in terms of updating our belief about a hypothesis A in the light of new evidence B. So our posterior belief P(A|B) is calculated by multiplying our prior belief P(A) by the likelihood P(B|A) that B will occur if A is true.
The basic form of Bayes Rule can be caste into a more generally useful form by using the total probability theorem discussed above.
which gives
Naïve Bayesian Classification
The Naïve Bayesian Classifier seeks to assign a vector X to a class C
If we assume that each component of the vector X is conditionally independent (see A.1.6) then we can write:
and hence we have
The P(Xi | C) are usually very easy to estimate from the training data. The Naïve Bayesian Classifier has been the subject of a large number of empirical studies in various domains and has been found to be surprisingly effective despite the over simplification inherent in the assumption of conditional independence.
Bayesian Networks
A Bayesian Network (BN) models the causal relationships of a system or dataset and provides a graphical representation of this causal structure though the use of Directed Acyclic Graphs(DAGs). The DAG representation then provides a framework for inference and prediction.
Consider the BN below:
This represents a causal relationship between 3 random variables A, B and C whereby there is a relationship
and
but there is no direct relationship between B and C.
The existence of the causal relationship
is represented by the arc/edge between the nodes A and B while the strength of the relationship is represented by the conditional probability P(B | A)
Conditional Independence
Conditional independence is fundamental to the Bayesean Network formalism. It is the central concept from which flows both their causal structure and their ability to reduce the complexity of its representation.
As stated above, B and C are conditionally independent. We can therefore write
and equally
A more concrete example of this might be:
Here we have a relationship between fraud and the purchase of fuel and one between fraud and age. This BN also captures that fact that we expect fuel and age to be independent of each other.
Consider the introductory example. Assume we have no prior knowledge about the state of A but we know the state of B. Using Bayes Rule we can infer some knowledge about A and hence increase our knowledge about the state of C. There is then a relationship between the state of B and the state of C. They are not independent. However, if we now assume prior knowledge about the state of A then whatever we know about B our knowledge of the state of A prevents this from being propagated to C. In this case B and C are independent.
In summary. B and C are conditionally independent given A.
Conditional independence also allows us to prune the number of probability distributions needed to model a process. In general, a system characterised by n random variables will require textstyle2n − 1 probability distributions to model it completely. So in the case of the simple 3 variable system above we could need 7. However, by noting the conditional independence of the variables B and C we can reduce this to just 3. This can be seen by considering the joint distribution, which by the chain-rule, can be written:
and which can then be simplified using the conditional independence formulae above to just
From this we can see that the root node’s probability P(A) is the prior probability that then feeds the conditional probabilities of its child node’s B and C.
This leads us to the formal definition of a Bayesean Network
Definition of a Baysean Network
In general we have by rewriting the chain-rule:
as
then by definition we require that:
P(Xi | Parents(Xi) = P(Xi | Xi − 1,...,X1) whereXi is the random variable for whichxi is an instance.
This condition can be satisfied if the following is true:
1. There is an ordering such that
2.
3. Xi
4. is conditionally independent of all but its parents
The representation of a BN as a Directed Acyclic Graph ensures these conditions are met.
Types of connection and d-separation
There are three basic types of connection between nodes each representing forms of conditional independence/dependence which motivates the concept of d-separation. Consider in each of the cases below the consequences of introducing evidence about the state of node A, B or C
Divergence
This is the example considered in the introduction above.
B and C are conditionally independent given A B and C are d-separated given A
Linear
Knowing the state of A blocks any influence of B on C. Conversely, just knowing B allows us to propagate our beleif through A to C.
B and C are d-separated given A
Convergent
Knowing the state of A means we can infer knowledge about both B and C. B and C are conditionally dependent on A. If we know nothing about A then B and C are independent.
Inference To illustrate how inference proceeds in a Bayesean network consider the example below adapted from Heckerman (1999)
If fraud is a random variable wherefraud = (true,false) and we define J=Jewellery, F=Fuel, S=Sex, A=Age
Then using Bayes Rule we can write:
Because the states of fraud are mutually exclusive and exhaustive the denominator can be re-written using the Total Probability Theorem giving:
then applying the Chain Rule we have:
then using the conditional independencies resulting from the causal structure of the BN illustrated above we can simplify this equation to:
we can simplify this equation further as P(S) and P(A) will cancel indicating that the prior probabilities for age and sex make no contribution to the computation of the probability of fraud. So we finally have
All the values in this equation are known from the probability distributions of the nodes in the Bayesean Network and hence the posterior probability for fraud can be calculated.
Network Discovery
The problem of discovering the causal structure of a Baysean Network is NP-hard (ie. it doesn’t get any harder!). Essentially, we need to discover the Directed Ascyclic Graph ( DAG) that best models the data. However, the number of DAGs for N variables is super-exponential in N. Given the many random variables that characterise a credit-card transaction then the search-space is too big for any automated discovery. We can reduce the search-space in may ways by introducing constraints: a. Specifying the ordering of the nodes which dictates the hierarchy of dependency in the BN b. Use a priori knowledge about dependencies to build a prototype BN which is then optimised. Dependencies may be suggested by first exploring the data using Association Rule discovery (see A.4) c. Use Monte Carlo methods to explore a bounded set of possiblities
Association Rules
Association rules have their origins in so called Basket Analysis. By examining the contents of shopping baskets as they pass through a checkout it is possible to determine associations between products that tend to be present in the basket at the same time. An association rule expresses this more formally as:
An association rule is expressed in terms of confidence and support as follows:
and
An association rule is semantically weak. Nothing can be inferred from the existence of an association rule other than the association exists. This is often all that is required and they are useful for exploratory mining of data. This should be contrasted to a causal model of the data like a Bayesian network.
Maximum Entropy Methods
Entropy
Entropy can be interpreted in many different ways. Here we view entropy as a measure of the uncertainty (or information) in a system.
Given a discrete system X which consisting of n components x each able to be in one of s micro-states so that
then the number of macro-states of the system X is simply
. C could be regarded as a measure of the ability of the system to store or represent information, so we would expect that by doubling the number of components x to 2n that the system would be able to store twice the information. This leads us to define the information as the log of C which is then linear in n, so that:
Now if the probability of each micro-state is equally likely then p = 1/s and we can write:
More generally the probability of each micro-state is not equal and each state
will have a probability
For each micro-state we can write
where
is the expected number of components in state
.
Then, as for large n we have
, we have
and can write
This is the well known Shannon entropy measure. There are others measures due to Renyi, Tsallis and others.
Entropy and Random Variables
The entropy for a random variable X can therefore be defined as
where
recalling that the expected value of an arbitrary function of X is
then the entropy can be seen to be:
where
is the Information function.
Simple example of entropy
Consider a random variable X that can take just two values with probabilities p and q so that
then the entropy H can be written
If we low look for the maximum of this equation then we have:
and hence
and
This example illustrates how by maximising the entropy we maximise the uncertainty given no other information. So that here the probability is uniformly distributed among the possible outcomes.
Joint Entropy
Joint entropy is simply a generalisation of the definition of entropy and is written
Conditional Entropy
Is the average over Y of the conditional entropy of X given
and measures the uncertainty that remains about X once Y is known.
For a particular value of
we can write
as we know Y we can write a weighted average over all Y as
Some properties of Entropy
Mutual Information
Entropic distance
Given two probability distributions p(X) and q(X) then the entropic distance between them is defined as the difference between their joint-entropy and their mutual-information
It satisfies the usual requirements of a metric:
Kullback-Leibler Distance
Also know as the relative entropy
Maximum Entropy Methods
In section A.5.2.1 we saw a very simple example of how maximising the entropy yielded a probability distribution with the greatest uncertainty. Given the available information about the distribution this was the best that could be achieved.
When modelling a random process we seek the probability distribution which fits the training data most uniformly subject to known constraints.
Consider the conditional probability distribution P(fraud | MCC)
Initially, given no sample data to constrain the problem, the ME solution would be the uniform distribution so that the probability of fraud would be the same for each Merchant Category Code (MCC) and would simply be 1/N where N is the number of MCCs.
To include the sample data we introduce the idea of a feature function:
Feature functions
Given a set of random variables textstyleX1...Xn which characterise a transaction, such as for instance Merchat Category Code (MCC), amount, time-of-day, … then a feature function is defined as any function that maps a subset of these variables to [0,1]. So we could have for example:
or identity mappings like
In general then, a feature is a binary valued function that abstracts aspects of the underlying data and is generally written
where
Constraints and Features
The Maximum Entropy Method (MEM) seeks a probability distribution which maximises the entropy subject to constraints. The constraints are formulated by requiring the expected value of the feature function to be the same as that derived from the training data.
So for a two variable feature we would have
Maximum Entropy Principal
We should select a probability distribution from the set of all probability distributions P which satisfy the constraints specified.
p * = argmaxH(p) where
This is a constrained optimisation problem which can be addressed using the method of Lagrange multipliers. While not difficult we do not go through this exercise here.
It can be shown that
is always well-defined and must be of the form:
where q is a scaling constant, k is the number of constraints and each a corresponds to one feature and can be regarded as a weight for that feature.
It should be noted that the literature often presents a different but equivalent form by substituting
we can then write
Parameter Estimation
Estimation of the parameters a or l is usually done using the Generalise Iterative Scaling (GIS) algorithm or the Improved Iterative Scaling (IIS) algorithm.
It has been suggested recently by Malouf that variable-metric and conjugate gradient descent methods can be more efficient for estimating parameters.
Kernel Density Estimation
This is a powerful technique when estimating Probability Density Functions (PDF) for sparse data as is the case for fraud sample populations. It is very similar to so called Radial Basis Functions.
It is a non-parametric estimator of a PDF as it makes no assumptions about the form.
A kernel function
is any positive-definite function that satisfies
The estimator for a function f(x) would then be
where n is the number of data points
X is a sample data point
h is the so called bandwidth
There as several common forms of K(x)
| Name | Properties | K(x) |
| Normal | nice and smooth, infinitely differentiable, expensive to compute especially in a real-time or near real-time environment (min 5 arithmetic ops + an exponential) |
|
| Triangular | fast to compute, spikey, implies slow convergence | |
| Epanechnikov | inverted parabola, nice and smooth, except at its boundaries, very cheap to compute (few arithmetic ops) very useful because it is zero outside its width (so it has no tails)good convergence properties | where and
|
Software
- R-Project Was GNU S - A language and environment for statistical computing and graphics
- Goose C++ library for statistical computation and presentation
- WEKA Java Data Mining toolkit
- Picalo Python based data-mining framework for Detection - enter the DetectLet
See Also
References
- Brause, R (1999) Credit Card Fraud Detection by Adaptive Neural Data Mining.
- Brause, R (1999) Neural Data Mining for Credit Card Fraud Detection.
- Charniak, E (1991). Bayesian Networks without Tears. AI Magazine
- Heckerman, D (1999). A tutorial on learning with Bayesean Networks.
- Malouf, R (2002). A comparison of algorithms for maximum entropy parameter estimation
- Pearl, J (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc.
- Siverman, B (1986). Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, London, Chapman Hall
- Stephenson, T (2000). An Introduction to Bayesian Network Theory and Usage, IDAP Research Report 00-03
- Whittaker, J (1990). Graphical Models in Applied Multivariate Statistics. John Wiley & Sons Ltd, Chichester, UK.
- Wolf, D (1994). Mutual Information as a Bayesian Measure of Independence
where
and
