Statistical classification

You don't need to be Editor-In-Chief to add or edit content to WikiDoc. You can begin to add to or edit text on this WikiDoc page by clicking on the edit button at the top of this page. Next enter or edit the information that you would like to appear here. Once you are done editing, scroll down and click the Save page button at the bottom of the page.

Jump to: navigation, search

Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.

Formally, the problem can be stated as follows: given training data \{(\mathbf{x_1},y_1),\dots,(\mathbf{x_n}, y_n)\} produce a classifier h:\mathcal{X}\rightarrow\mathcal{Y} which maps an object \mathbf{x} \in \mathcal{X} to its classification label y \in \mathcal{Y}. For example, if the problem is filtering spam, then \mathbf{x_i} is some representation of an email and y is either "Spam" or "Non-Spam".

Statistical classification algorithms are typically used in pattern recognition systems.

Note: in community ecology, the term "classification" is synonymous with what is commonly known (in machine learning) as clustering. See that article for more information about purely unsupervised techniques.

Contents

Statistical classification techniques

While there are many methods for classification, they are solving one of three related mathematical problems

  • The first is to find a map of a feature space (which is typically a multi-dimensional vector space) to a set of labels. This is equivalent to partitioning the feature space into regions, then assigning a label to each region. Such algorithms (e.g., the nearest neighbour algorithm) typically do not yield confidence or class probabilities, unless post-processing is applied. Another set of algorithms to solve this problem first apply unsupervised clustering to the feature space, then attempt to label each of the clusters or regions.
  • The second problem is to consider classification as an estimation problem, where the goal is to estimate a function of the form
P({\rm class}|{\vec x}) = f\left(\vec x;\vec \theta\right)

where the feature vector input is \vec x, and the function f is typically parameterized by some parameters \vec \theta. In the Bayesian approach to this problem, instead of choosing a single parameter vector \vec \theta, the result is integrated over all possible thetas, with the thetas weighted by how likely they are given the training data D:

P({\rm class}|{\vec x}) = \int f\left(\vec x;\vec \theta\right)P(\vec \theta|D) d\vec \theta


Examples of classification algorithms include:

An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers). Van der Walt and Barnard (see reference section) investigated very specific artificial data sets to determine conditions under which certain classifiers perform better and worse than others.

Classifier performance depend greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the No-free-lunch theorem). Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science.

The most widely used classifiers are the Neural Network (Multi-layer Perception), Support Vector Machines, k-Nearest Neighbours, Gaussian Mixture Model, Gaussian, Naive Bayes, Decision Tree and RBF classifiers.

Application domains

References

  • C.M. van der Walt and E. Barnard,“Data characteristics that determine classifier performance”, in Proceedings of the Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, pp.160-165, 2006.

External links

See also

de:Klassifikationsverfahren lt:Klasifikavimo algoritmaith:การแบ่งประเภทข้อมูล


WikiDoc Help Menu

Quick Start..

Editing basics

Advanced editing

Communicating your edits

Help Videos You Can Watch

Acknowledgement and Attribution Regarding Sources of Content

Some of the initial content on this page may be incorporated in part from copyleft sources in the public domain including wikis such as Wikipedia and AskDrWiki. Drug information for patients came from the The National Library of Medicine. Infectious disease information may have come from the Centers for Disease Control (CDC). Differential Diagnoses are drawn from clinicians as well as an amalgamation of 3 sources: 1.The Disease Database; 2. Kahan, Scott, Smith, Ellen G. In A Page: Signs and Symptoms. Malden, Massachusetts: Blackwell Publishing, 2004:3; 3. Sailer, Christian, Wasner, Susanne. Differential Diagnosis Pocket. Hermosa Beach, CA: Borm Bruckmeir Publishing LLC, 2002:7 .