毕业论文(设计)
外文翻译
外文原文
Improved Use of Continuous Attributes in C4.5
J. R. Quinlan
Basser Department of Computer Science
University of Sydney, Sydney Australia 2006
Abstract
A reported weakness of C4.5 in domains with continuous attributes is addressed by modifying the formation and evaluation of tests on continuous attributes. An MDL-inspired penalty is applied to such tests, eliminating some of them from consideration and altering the relative desirability of all tests. Empirical trials show that the modifications lead to smaller decision trees with higher predictive accuracie
s. Results also confirm that a new version of C4.5 incorporating these changes is superior to recent approaches that use global discrimination and that construct small trees with multi-interval splits.
1. Introduction
Most empirical learning systems are given a set of pre-classified cases, each described by a vector of attribute values, and construct from them a mapping from attribute values to classes. The attributes used to describe cases can be grouped into continuous attributes, whose values are numeric, and discrete attributes with unordered nominal values. For example, the description of a person might include weight in kilograms, with a value such as 73.5, and color of eyes whose value is one of `brown', `blue', etc.
C4.5 (Quinlan, 1993) is one such system that learns decision-tree classifiers. Several authors have recently noted that C4.5's performance is weaker in domains with a preponderance of continuous attributes than for learning tasks that have mainly discrete attributes. For example, Auer, Holte, and Maass (1995) describe T2, a system that searches for good two-level decision
trees, and comment:
values翻译
The accuracy of T2's trees rivaled or surpassed C4.5's on 8 of the [15] datasets, including all but one of the datasets having only continuous attributes."
Discussing the effect of replacing continuous attributes by discrete attributes, each of whose values corresponds to an interval of the continuous attribute, Dougherty, Kowhai, and Salami (1995) write: \C4.5's performance was significantly improved on two datasets ::: using the entropy discrimination method and did not significantly degrade on any dataset. :: We conjecture that the C4.5 induction algorithm is not taking full advantage of possible local discrimination."
This paper explores a new version of C4.5 that changes the relative desirability of using continuous attributes. Section 2 sketches the current system, while the following section describes the modifications. Results from a comprehensive set of trials, reported in Section 4, show that the modifications lead to trees that are both smaller and more accurate. Section 5 compares the performance of the new version to results obtained with the two alternative methods of exploiting continuous attributes quoted above.
2. Constructing Decision Trees
C4.5 uses a divide-and-conquer approach to growing decision trees that was pioneered by Hunt and hi
s co-workers (Hunt, Marin, & Stone, 1966). Only a brief description of the method is given here; see Quinlan (1993) for a more complete treatment.
The following algorithm generates a decision tree from a set D of cases:
If D satisfies a stopping criterion, the tree for D is a leaf associated with the most frequent class in D. One reason for stopping is that D contains only cases of this class, but other criteria can also be formulated (see below).
Some test T with mutually exclusive outcomes T 1;T 2;:::;T k is used to partition D into subsets D1;D2;:::;Dk, where Di contains those cases that have outcome Ti. The tree for D has test T as its root with one subtree for each outcome Ti that is constructed by applying the same procedure recursively to the cases in Di.
Provided that there are no cases with identical attribute values that belong to different classes, any test T that produces a non-trivial partition of D will eventually lead to single class subsets as above. However, in the expectation that smaller trees are preferable (being easier to understand and often more accurate predictors), a family of possible tests is examined and one of them chosen
to maximize the value of some splitting criterion. The default tests considered by C4.5 are: A=? for a discrete attribute A, with one outcome for each value of A.
A1t for a continuous attribute A, with two outcomes, true and false. To find the threshold t that maximizes the splitting criterion, the cases in D are sorted on their values of attribute A to give ordered distinct values v1,v2,:::,vN. Every pair of ad- jacent values suggests a potential threshold t = (vi + vi+1)=2 and a corresponding partition of D 1. The threshold that yields the best value of the splitting criterion is then selected.
The default splitting criterion used by C4.5 is gain ratio, an information-based measure that takes into account different numbers (and different probabilities) of test outcomes. Let C denote the number of classes and p(D;j) the proportion of cases in D that belong to the jth class. The residual uncertainty about the class to which a case in D belongs can be expressed as
and the corresponding information gained by a test T with k outcomes as
The information gained by a test is strongly affected by the number of outcomes and is maximal when there is one case in each subset Di. On the other hand, the potential information obtained by partitioning a set of cases is based on knowing the subset Di into which a case falls; this split informati
on
Tends to increase with the number of outcomes of a test. The gain ratio criterion assesses the desirability of a test as the ratio of its information gain to its split information. The gain ratio of every possible test is determined and, among those with at least average gain, the split with maximum gain ratio is selected.
In some situations, every possible test splits D into subsets that have the same class distribution. All tests then have zero gain, and C4.5 uses this as an additional stopping criterion.
The recursive partitioning strategy above results in trees that are consistent with the training data, if this is possible. In practical applications data are often noisy { attribute values are incorrectly recorded and cases are misclassified. Noise leads to overly complex trees that attempt to account for these anomalies. Most systems prune the initial tree, identifying subtrees that contribute little to predictive accuracy and replacing each by a leaf.
3. Modified Assessment of Continuous Attributes
We return now to the selection of a threshold for a continuous attribute A. If there are N distinct values
of A in the set of cases D, there are N _ 1 thresholds that could be used for a test on A. Each threshold gives unique subsets D1 and D2 and so the value of the splitting criterion is a function of the threshold. The ability to choose the threshold t so as to maximize this value gives a continuous attribute A an advantage over a discrete attribute (which has no similar parameter that adjusts the partition of D), and also over other continuous attributes that have fewer distinct values in D. That is, the choice of a test will be biased towards continuous attributes with numerous distinct values.
This paper proposes a correction for this bias that consists of two modifications to C4.5. The first of these, inspired by the Minimum Description Length principle (Rissanen, 1983), adjusts the apparent information gain from a test of a continuous attribute. Discussion of this change is prefaced by a brief introduction to MDL.
Following Quinlan and Rivest (1989), let a sender and a receiver both possess an ordered list of the cases in the training data showing each case's attribute values. The sender also knows the class to which each case belongs and must transmit this information to the receiver. He or she first encodes and sends a theory of how to classify the cases. Since this theory might be imperfect, the sender must also identify the exceptions to the theory.That occur in the training cases and state how their classes predicted by the theory should be corrected. The total length of the transmission is thus the number of
bits required to encode the theory (the theory cost) plus the bits needed to identify and correct the exceptions (the exceptions cost). The sender may have a choice among several alternative theories, some being simple but leaving many errors to be corrected while others are more elaborate but more accurate. The MDL principle may then be stated as: Choose the theory that minimizes the sum of the theory and exceptions costs.
MDL thus provides a framework for trading off the complexity of a theory against its
accuracy on the training data D. The exceptions cost associated with a set of cases D is asymptotically equivalent to jDj_Info(D), so that jDj_Gain(D;T) measures the reduction in exceptions cost when D is partitioned by a test T. Partitioning D in this way, however, requires transmission of a more complex theory that includes the definition of T. Whereas a test A=? on a discrete attribute A can be specified by nominating the attribute involved, a test A1t must also include the threshold t; if there are N _1 possible thresholds for A, this will take an additional log2(N _1) bits. 2 The first modification is to \charge" this increased cost associated with a test on a continuous attribute to the apparent gain achieved by the test, so reducing the (per-case) information Gain(D;T) by log2(N _ 1)/jDj.
A test on a continuous attribute with numerous distinct values will now be less likely to have the maxim
um value of the splitting criterion among the family of possible tests, and so is less likely to be selected. Further, if all thresholds t on a continuous attribute A have unadjusted gain that is less than zero, attribute A is effectively ruled out. The consequences of this first change are thus a re-ranking of potential tests and the possible exclusion of some of them.
The second change is much more straightforward. Recall that the gain ratio criterion divides the apparent gain by the information available from a split. This latter varies as a function of the threshold t and is maximal when there are as many cases above t as below. If the gain ratio criterion is used to select t, the effect of the penalty described above will also vary with t, having the least impact when t divides the cases equally. This seems to be an unnecessary complication, so the threshold t is chosen instead to maximize gain. Once the threshold is chosen, however, the final selection of the attribute to be used for the test is still made on the basis of the gain ratio criterion using the adjusted gain.
译文
C4.5算法连续属性的改良应用
J. R. Quinlan
*************.oz.au
计算机科学Basser学部