什么是宏平均(macro-average)和微平均(micro-average)什么是宏平均(macro-average)和微平均(micro-average)
Fri, 05/14/2010 - 14:53 — Fuller
宏平均(macro-average)和微平均(micro-average)是衡量⽂本分类器的指标。
根据Coping with the
News: the machine learning way
When dealing with multiple classes there are two possible ways of averaging these
recall, precision, F1-measure) , namely, macro-average and
micro-average. The macro-average weights equally all the classes, regardless of how many
documents belong to it. The micro-average weights equally all the documents, thus favouring
the performance on common classes. Different classifiers will perform different in common
and rare categories. Learning algorithms are trained more often on more populated classes
thus risking local over-fitting.
宏平均指标相对微平均指标⽽⾔受⼩类别的影响更⼤
⽂章《⼀种快速⾼效的⽂本分类⽅法》给出了⼏个⽂本分类性能评估的公式。对于给定的某个
类别,a 表
⽰被正确分到该类的实例的个数,b 表⽰被误分到该类的实例的个数,c 表⽰属于该类但被误分
到其它类
别的实例的个数,则准确率(p)和召回率(r)和F-指标分别被定义为:
r = a / (a + c), if a + c > 0; otherwise r = 1
p = a / (a + b), if a + b > 0; otherwise p = 1
其中参数β⽤来为准确率(p)和召回率(r)赋予不同的权重,当β取1 时,准确率和召回率被
赋予相
同的权重。
为了评估算法在整个数据集上的性能,有两种平均的⽅法可供使⽤,分别称为宏平均
(macro_average)
和微平均(micro_average)。宏平均是每⼀个类的性能指标的算术平均值,⽽微平均是每⼀个
实例(⽂
档)的性能指标的算术平均。对于单个实例⽽⾔,它的准确率和召回率是相同的(要么都是1,
要么都是
0)因此准确率和召回率的微平均是相同的,根据F-指标公式,对于同⼀个数据集它的准确率、
召回率和
F1 的微平均指标是相同的。
Coping with the News: the machine learning
way
Dr. Rafael A. Calvo [HREF1], Web Engineering Group The University of
Sydney [HREF2], NSW, 2006. rafa@ee.usyd.edu.au
Prof. Jae-Moon Lee [HREF1], Hansung University[HREF3], Web Engineering
Group The University of Sydney [HREF2], NSW, 2006.
jmlee@ee.usyd.edu.au
Abstract News articles represent some of the most popular and commonly accessed
content on the web. This paper describes how machine learning and
automatic document classification techniques can be used for managing
large numbers of news articles. In this paper, we work with more than
800,000 of Reuters news stories and classify them using a Naive Bayes and
k-Nearest Neighbours approach. The articles are stored in newsML format,
commonly used for content syndication. The methodology developed would
enable a web based routing system to automatically filter and deliver news
to users based on an interest profile.
Introduction
Information overload ? in which individuals are faced with an oversupply of
content ? could become the road rage for the new millennium. In order for
this content to become useful information and empower users, rather than
frustrate or confuse them, we need novel ways of delivering only what is
needed, at the right time and in the right format. News stories are the most
important source of up-to-date information about the world around us, and represent some of the most often updated, and highest quality content on
the web. Therefore, it is most important to develop ways to processes news efficiently. In this paper we have studied machine learning models, and
applied them to the automatic classification of large numbers of online news articles.
Language technologies have provided excellent tools for information
retrieval by exploiting the content being indexed. In the beginnings of the
Internet, search engines and classification systems on the web, only used information from unstructured text to index and find content search engines
(i.e. Altavista); later they used the topology of the net to find which of the information being indexed was more important (i.e. Google). Meanwhile,
the progress achieved in document classification has not been as noticeable,
and most web page classification is done manually by thousands of human indexers, in private catalogues or in large scale ones such as Yahoo! and the
Open Directory Project. Instead of using human labour, automatic
document classification tools[3,11,12] use statistical models, similar to
those in information retrieval. These systems can be used to create
hierarchical taxonomies of content as are commonly found on the web, on
e-learning systems and even on traditional library information systems.
News story articles are written by reporters from all over the world. These
reporters often work for news agencies such as the Associated Press and Reuters. These agencies collect the news, edit it and sell bundles of articles
to the periodicals accessed by web users (e.g. news.yahoo, Sydney
Morning Herald, etc). It is important for both the agencies and the periodicals to have an organized well-managed stream of news. News are
normally classified according to taxonomies that are relevant to readers,
(e.g. “politics”, “Iraq” or “Oil”). This classification can be very difficult
because it requires human expertise to spot relationships between the
taxonomy and the documents. Even the experts do not agree on what
should go where and inter-indexer consistency in classification has
considerable variation [9].
Automatic classification techniques use algorithms that learn from these
human classifications, so they can only do as well as the human training
data provided. In addition, different algorithms can learn different types of
patterns in the data. In order to compare the classification performance of
different algorithms, researchers have a set of standarised benchmarks,
with a particular dataset, and a well defined task. The most popular
classification benchmark during the late nineties was a Reuters collection
called Reuters-21578 (based on Reuters-22173) with 21578 documents,
that had to be classified in about 100 different [4,13]. This benchmark is still
used currently to compare the performance of different algorithms but the challenges now lie in moving towards larger scale document classification
[8]. In 2002, Reuters released a new research dataset with over 800,000 documents that we discuss in this paper.
There are several Machine Learning (ML) algorithms that have been
successfully used in the past [4,7,11,13]. They include Neural Networks,
Naive Bayes, Support Vector Machines (SVM) and k-Nearest neighbours (kNN). Each of these methods has their advantages and limitations on classification performance and scalability. The choice of algorithms will
depend in the application, and the amount of data to be used. In web applications, efficiency is of particular importance, since the large number of
users and data can make some algorithms unfeasible.
Section 2 of this paper describes the Reuters RCV1 collection used in this
project, focusing on the challenges offered by its size, structure and the
richness of its XML structure. Section 3 describes the ML algorithms we have used, namely Naive Bayes and k-Nearest Neighbours, section 4 describes
how we have improved a classification framework [12] in order to make it
scalable for web applications. Section 5 describes the performance results
for the classifier and the improvements in memory and CPU requirements.
Section 6 concludes and summarizes our future work in the area.
The Reuters collection, challenges and solutions
The Reuters RCV1 Corpus [9] consists of all English news stories (806,791)
published by Reuters in the period between 20/8/1996 and 19/8/1997. The news is stored as files in XML format, using a News ML Document Type
definition (DTD). NewsML is an open standard being developed by the
International Press and Telecommunications Council (IPTC). The news is
written by approximately 2000 reporters and then classified by Reuters
specialists in a number of ways. The classified news articles are then
syndicated by websites such as news.yahoo le or
periodicals like the Sydney Morning Herald that may or may not have a
website.
Due to seasonal variations, the number of stories per day is not a constant.
In addition, on weekdays there are an average of 2,880 stories per day
compared to 480 on weekends. Approximately 3.7Gb is required for the
storage of the uncompressed XML files.
The NewsML schema contains metadata produced by human indexers about
themes and categories. When two humans index a document differently
they create inter-indexer variations. These can be measured using a
correction rate C=(NC/NE)*100, where NE is the number of stories indexed
by an editor and NC is the number of times an editor has been corrected by
a second editor. Normally untrained editors will have a higher C than more
experts one, but even when they are all experienced correction rates of 10%
are common. In the RCV1 collection there are correction rates of up to 77%.
Since ML algorithms learn from examples ?classifications done by humans-
correction rates are an important limiting factor to their performance.
Performance measures in classification systems are really a measure of how
much they correlate to the human classifiers, it is not possible for a student
to be better than the teacher, or at least it not possible for the teacher to
know so.
RCV1 data is stored in XML documents providing the metadata information
normally required by news agencies and periodicals who need to deliver the
stories to end users. NewsML defines a rich schema shown simplified in
Figure 1, we can see that it has entities for title, headline, text, copyright
and several types of classification. For our experiments we have used all
three available classifications (topic, country and industry) as a single task.
Figure 1: An example of newsML document
Web applications will increasingly exploit this type of metadata. As it has
been discussed by researchers studying the concept of the semantic web [2],
the next revolution in the Internet will come when web applications have
access to structured collections of information, and set of inference rules
that can be used to perform automated reasoning. This project aims at
producing such applications.
Language Technologies and document classification
Language technologies is an emerging field with researchers from different backgrounds. Mathematicians, linguistics and computers scientists are
producing theories and software tools that manage content more efficiently.
This research includes speech processing, information retrieval and
document classification. The models in document classification are similar to
those in the other areas, and extensive literature describes them in detail
[4,7,11,14]. For the sake of brevity, in this section we only summarize the
basic concepts.
The essence of the classification techniques is that documents and
categories can be represented in high dimensional vector space. Since
classification consists of assigning documents to predefined a categories,
machine learning techniques have to learn the mapping between the two
vector spaces, document to categories.
The simplest model to represent a document as a vector is the binary
representation. In this model, a vector has the dimension of the dictionary, and the occurrence of a word puts a 1 in the corresponding element, all
other elements being 0.
The next level of complexity, and the one we have used, is called Term
Frequency (TF) since the value of the element is equal to the number of
occurrences of the term in the document. If the term appears more often, it
is often because it is more important. This is not always the case, since words like articles and prepositions often do not add much to the information value. These terms are called stop-words and normally eliminated from the document. Other technique to reduce the number of terms is stemming, where words are reduced to their common stem (i.e. “runner” and “runners” are reduced to “run”.
Finally, more elaborate models take into account the idea that terms that appear in many documents are not telling us much about an individual one, so an Inverse Document Frequency (IDF) weight is used. These models require computing statistics across the whole corpus, making it much more computational expensive. Since our goal in this project was to study scalable techniques that could be used in web applications, we have used the simpler TF vector representations.
Once the documents have been represented as vectors, they can be further reduced using statistical reduction techniques such as c2, or term frequency. These techniques select those terms that have h
igher impact or correlation with the classification. We do not discuss them here as several other sources describe them in detail [4,7,11,13].
Machine learning algorithms are trained on a subset of the corpus, and later tested on a different subset to measure their performance. Several ML algorithms have been used successfully in a number of applications [3,11], including Naive Bayes and kNN both used here.
The success of a classification is often stated with two standard efficacy measures: precision and recall. Precision is the probability that if a random document is classified as part of C, this classification is correct. Recall is the probability that if a random document ought to be classified under C, this classification is actually made. These two quantities are highly and inversely correlated. We can have absolute recall by assigning all documents to all categories, but at the cost of having the worst precision. We can also have high precision by never assigning a document to a category, but this would make the classifier useless.
The trade-off between recall and precision is controlled by setting the classifiers’ parameters. Both values should be provided to describe the performance. Another common performance measure is the F1-measure:  When dealing with multiple classes there are two possible ways of averaging these measures, namely, macro-average and micro-average. The
macro-average weights equally all the classes, regardless of how many documents belong to it. The micro-average weights equally all the documents, thus favouring the performance on common classes. Different classifiers will perform different in common and rare categories. Learning algorithms are trained more often on more populated classes thus risking local over-fitting.
Object Buffering and performances
Object Oriented Application Frameworks (OOAF) are software engineering artefacts that improve reusability of design and implementation [5,6,10].
The classification framework used in this project[HREF4] [12] was designed to provide a consistent way to build document categorization systems. It allows the developer to focus on the quality of the more critical and complex modules by allowing reuse of common, simple base modules. In this project we study how the framework can be used in a news stories classification system and extended the framework to be more scalable as required in web applications by adding an object buffering module.
One of the most serious limitations when training ML algorithms, is the amount of data required to achieve satisfactory performance and the scalability issues that arise when training such algorithms.
Memory requirements are often the first obstacle for managing large corpora, specially for a text categorization system, where the ML algorithms are being trained on very large vector spaces and large amounts of data. Most operating systems have dealt with this problems by adding support for virtual memory, but these general purpose techniques are not always enough or appropriate for special applications like ours. In those cases, applications often have an adaptive buffer management tool that allows the
application to allocate specific amounts of memory.
Figure 2: Run time object diagram
We have extended our text categorization framework [12] so it can handle
much larger corpora. Figure 2 shows the object diagram of important
classes in the run time of the framework. The circle in Figure 2 represents an
object and the arrow represents a relationship meaning a
category/document ‘HAS-A’ feature vector (FV). The document object
corresponds exactly to a document in the training or testing sets. The
category object corresponds exactly to a category. For the RCV1, there are
over 800K documents, so the framework needs to manage that number of
document objects and the same number of feature vectors. The object
buffer can be configured to store as many objects as possible in RAM
memory, and requesting other ones from the hard disk as they are required.
Thus, there is only a constant amount of objects in the memory, and the
system can be optimised for the available resources and independently on
the amount of data.
The buffered object module we implemented is described in Figure 3. The
module supports a unique persistent Object Identifier (OID) for all objects
stored in it. The framework will then use this OID instead of the object or its
reference. If the framework needs an object to be stored in the buffered
object module, it makes a request to the buffered object manager using the
OID. The buffered object manager first searches for the request object in the
object pool. If object manager finds it, then it returns the object’s reference,
otherwise it requests the object to the buffered file manager which tries to
minimize I/O to the disk. The buffered file manager reads the request object
from the disk, and returns it to the buffered object manager. As the buffered
object manager receives the object from the buffered file manager, it firstly
stores the received object to the buffer pool and returns it to the requester
in the framework.
Figure 3: Architecture of the buffered object module
In order to optimise memory usage, the buffered object manager uses a
constant amount of memory. Our current implementation follows the Least
Recently Used (LRU) strategy [1]. In this strategy, common for paging in
operating systems, the least recently used (read or written) object (or page)
is selected to be taken out of the buffer (paged out). The same rule is also
used in other cache systems when they select which cache entry to flush. In
future work we plan to add new buffering strategies such as: Most Recently
Used (MRU) and others. We expect that the selection of these strategies will
depend on each Machine Learning algorithm. For example, Naive Bayes
might work better with a LRU and kNN with a MRU strategy.
With a scalable framework we are able to show the feasibility of
html document是什么automatically classifying large amounts of news stories. Since news stories
in the RCV1 dataset are in newsML format the extensions to the framework
need to include more efficient XML parsing. We first tried the Perl and C implementations of the SAX package. Adapting one of the libraries
implemented in C and integrating it into the framework produced the best
results.
Results
We have tested the extended framework on the Reuters RCV1 data in order
to:
·  Assess the feasibility of automatic document classification on large-scale management of news stories.
·  Measure the classification performance of Naive Bayes and kNN
classifiers on this new corpus, and find ways to improve it.
·  Measure the improvements over the non-buffered framework in
computing time and memory management.  We performed three series of experiments: one to determine how Naive
Bayes and kNN algorithms performed for different amounts of training data,
a second one to see if using the newsML structure could improve
performance by giving different weights to the title of the story, and finally
how they scaled ?in computation time- for different amounts of data.
Although the precision of the classification itself should not be modified by