什么是宏平均(macro-average)和微平均(micro-average)--688IT编程网

什么是宏平均（macro-average）和微平均（micro-average）什么是宏平均（macro-average）和微平均（micro-average）

Fri, 05/14/2010 - 14:53 — Fuller

宏平均（macro-average）和微平均（micro-average）是衡量⽂本分类器的指标。

根据Coping with the

News: the machine learning way

When dealing with multiple classes there are two possible ways of averaging these

recall, precision, F1-measure) , namely, macro-average and

micro-average. The macro-average weights equally all the classes, regardless of how many

documents belong to it. The micro-average weights equally all the documents, thus favouring

the performance on common classes. Different classifiers will perform different in common

and rare categories. Learning algorithms are trained more often on more populated classes

thus risking local over-fitting.

宏平均指标相对微平均指标⽽⾔受⼩类别的影响更⼤

⽂章《⼀种快速⾼效的⽂本分类⽅法》给出了⼏个⽂本分类性能评估的公式。对于给定的某个

类别，a 表

⽰被正确分到该类的实例的个数，b 表⽰被误分到该类的实例的个数，c 表⽰属于该类但被误分

到其它类

别的实例的个数，则准确率（p）和召回率（r）和F-指标分别被定义为：

r = a / (a + c), if a + c > 0; otherwise r = 1

p = a / (a + b), if a + b > 0; otherwise p = 1

其中参数β⽤来为准确率（p）和召回率（r）赋予不同的权重，当β取1 时，准确率和召回率被

赋予相

同的权重。

为了评估算法在整个数据集上的性能，有两种平均的⽅法可供使⽤，分别称为宏平均

（macro_average）

和微平均（micro_average）。宏平均是每⼀个类的性能指标的算术平均值，⽽微平均是每⼀个

实例（⽂

档）的性能指标的算术平均。对于单个实例⽽⾔，它的准确率和召回率是相同的（要么都是1，

要么都是

0）因此准确率和召回率的微平均是相同的，根据F-指标公式，对于同⼀个数据集它的准确率、

召回率和

F1 的微平均指标是相同的。

Coping with the News: the machine learning

way

Dr. Rafael A. Calvo [HREF1], Web Engineering Group The University of

Sydney [HREF2], NSW, 2006. rafa@ee.usyd.edu.au

Prof. Jae-Moon Lee [HREF1], Hansung University[HREF3], Web Engineering

Group The University of Sydney [HREF2], NSW, 2006.

jmlee@ee.usyd.edu.au

Abstract News articles represent some of the most popular and commonly accessed

content on the web. This paper describes how machine learning and

automatic document classification techniques can be used for managing

large numbers of news articles. In this paper, we work with more than

800,000 of Reuters news stories and classify them using a Naive Bayes and

k-Nearest Neighbours approach. The articles are stored in newsML format,

commonly used for content syndication. The methodology developed would

enable a web based routing system to automatically filter and deliver news

to users based on an interest profile.

Introduction

Information overload ? in which individuals are faced with an oversupply of

content ? could become the road rage for the new millennium. In order for

this content to become useful information and empower users, rather than

frustrate or confuse them, we need novel ways of delivering only what is

needed, at the right time and in the right format. News stories are the most

important source of up-to-date information about the world around us, and represent some of the most often updated, and highest quality content on

the web. Therefore, it is most important to develop ways to processes news efficiently. In this paper we have studied machine learning models, and

applied them to the automatic classification of large numbers of online news articles.

Language technologies have provided excellent tools for information

retrieval by exploiting the content being indexed. In the beginnings of the

Internet, search engines and classification systems on the web, only used information from unstructured text to index and find content search engines

(i.e. Altavista); later they used the topology of the net to find which of the information being indexed was more important (i.e. Google). Meanwhile,

the progress achieved in document classification has not been as noticeable,

and most web page classification is done manually by thousands of human indexers, in private catalogues or in large scale ones such as Yahoo! and the

Open Directory Project. Instead of using human labour, automatic

document classification tools[3,11,12] use statistical models, similar to

those in information retrieval. These systems can be used to create

hierarchical taxonomies of content as are commonly found on the web, on

e-learning systems and even on traditional library information systems.

News story articles are written by reporters from all over the world. These

reporters often work for news agencies such as the Associated Press and Reuters. These agencies collect the news, edit it and sell bundles of articles

to the periodicals accessed by web users (e.g. news.yahoo, Sydney

Morning Herald, etc). It is important for both the agencies and the periodicals to have an organized well-managed stream of news. News are

normally classified according to taxonomies that are relevant to readers,

(e.g. “politics”, “Iraq” or “Oil”). This classification can be very difficult

because it requires human expertise to spot relationships between the

taxonomy and the documents. Even the experts do not agree on what

should go where and inter-indexer consistency in classification has

considerable variation [9].

Automatic classification techniques use algorithms that learn from these

human classifications, so they can only do as well as the human training

data provided. In addition, different algorithms can learn different types of

patterns in the data. In order to compare the classification performance of

different algorithms, researchers have a set of standarised benchmarks,

with a particular dataset, and a well defined task. The most popular

classification benchmark during the late nineties was a Reuters collection

called Reuters-21578 (based on Reuters-22173) with 21578 documents,

that had to be classified in about 100 different [4,13]. This benchmark is still

used currently to compare the performance of different algorithms but the challenges now lie in moving towards larger scale document classification

[8]. In 2002, Reuters released a new research dataset with over 800,000 documents that we discuss in this paper.

There are several Machine Learning (ML) algorithms that have been

successfully used in the past [4,7,11,13]. They include Neural Networks,

Naive Bayes, Support Vector Machines (SVM) and k-Nearest neighbours (kNN). Each of these methods has their advantages and limitations on classification performance and scalability. The choice of algorithms will

depend in the application, and the amount of data to be used. In web applications, efficiency is of particular importance, since the large number of

users and data can make some algorithms unfeasible.

Section 2 of this paper describes the Reuters RCV1 collection used in this

project, focusing on the challenges offered by its size, structure and the

richness of its XML structure. Section 3 describes the ML algorithms we have used, namely Naive Bayes and k-Nearest Neighbours, section 4 describes

how we have improved a classification framework [12] in order to make it

scalable for web applications. Section 5 describes the performance results

for the classifier and the improvements in memory and CPU requirements.

Section 6 concludes and summarizes our future work in the area.

The Reuters collection, challenges and solutions

The Reuters RCV1 Corpus [9] consists of all English news stories (806,791)

published by Reuters in the period between 20/8/1996 and 19/8/1997. The news is stored as files in XML format, using a News ML Document Type

definition (DTD). NewsML is an open standard being developed by the

International Press and Telecommunications Council (IPTC). The news is

written by approximately 2000 reporters and then classified by Reuters

specialists in a number of ways. The classified news articles are then

syndicated by websites such as news.yahoo le or

periodicals like the Sydney Morning Herald that may or may not have a

website.

Due to seasonal variations, the number of stories per day is not a constant.

In addition, on weekdays there are an average of 2,880 stories per day

compared to 480 on weekends. Approximately 3.7Gb is required for the

storage of the uncompressed XML files.

The NewsML schema contains metadata produced by human indexers about

themes and categories. When two humans index a document differently

they create inter-indexer variations. These can be measured using a

correction rate C=(NC/NE)*100, where NE is the number of stories indexed

by an editor and NC is the number of times an editor has been corrected by

a second editor. Normally untrained editors will have a higher C than more

experts one, but even when they are all experienced correction rates of 10%

are common. In the RCV1 collection there are correction rates of up to 77%.

Since ML algorithms learn from examples ?classifications done by humans-

correction rates are an important limiting factor to their performance.

Performance measures in classification systems are really a measure of how

much they correlate to the human classifiers, it is not possible for a student

to be better than the teacher, or at least it not possible for the teacher to

know so.

RCV1 data is stored in XML documents providing the metadata information

normally required by news agencies and periodicals who need to deliver the

stories to end users. NewsML defines a rich schema shown simplified in

Figure 1, we can see that it has entities for title, headline, text, copyright

and several types of classification. For our experiments we have used all

three available classifications (topic, country and industry) as a single task.

Figure 1: An example of newsML document

Web applications will increasingly exploit this type of metadata. As it has

been discussed by researchers studying the concept of the semantic web [2],

the next revolution in the Internet will come when web applications have

access to structured collections of information, and set of inference rules

that can be used to perform automated reasoning. This project aims at

producing such applications.

Language Technologies and document classification

Language technologies is an emerging field with researchers from different backgrounds. Mathematicians, linguistics and computers scientists are

producing theories and software tools that manage content more efficiently.

This research includes speech processing, information retrieval and

document classification. The models in document classification are similar to

those in the other areas, and extensive literature describes them in detail

[4,7,11,14]. For the sake of brevity, in this section we only summarize the

basic concepts.

The essence of the classification techniques is that documents and

categories can be represented in high dimensional vector space. Since

classification consists of assigning documents to predefined a categories,

machine learning techniques have to learn the mapping between the two

vector spaces, document to categories.

The simplest model to represent a document as a vector is the binary

representation. In this model, a vector has the dimension of the dictionary, and the occurrence of a word puts a 1 in the corresponding element, all

other elements being 0.

The next level of complexity, and the one we have used, is called Term

Frequency (TF) since the value of the element is equal to the number of

occurrences of the term in the document. If the term appears more often, it

is often because it is more important. This is not always the case, since words like articles and prepositions often do not add much to the information value. These terms are called stop-words and normally eliminated from the document. Other technique to reduce the number of terms is stemming, where words are reduced to their common stem (i.e. “runner” and “runners” are reduced to “run”.

Finally, more elaborate models take into account the idea that terms that appear in many documents are not telling us much about an individual one, so an Inverse Document Frequency (IDF) weight is used. These models require computing statistics across the whole corpus, making it much more computational expensive. Since our goal in this project was to study scalable techniques that could be used in web applications, we have used the simpler TF vector representations.

Once the documents have been represented as vectors, they can be further reduced using statistical reduction techniques such as c2, or term frequency. These techniques select those terms that have h

igher impact or correlation with the classification. We do not discuss them here as several other sources describe them in detail [4,7,11,13].

Machine learning algorithms are trained on a subset of the corpus, and later tested on a different subset to measure their performance. Several ML algorithms have been used successfully in a number of applications [3,11], including Naive Bayes and kNN both used here.

The success of a classification is often stated with two standard efficacy measures: precision and recall. Precision is the probability that if a random document is classified as part of C, this classification is correct. Recall is the probability that if a random document ought to be classified under C, this classification is actually made. These two quantities are highly and inversely correlated. We can have absolute recall by assigning all documents to all categories, but at the cost of having the worst precision. We can also have high precision by never assigning a document to a category, but this would make the classifier useless.

The trade-off between recall and precision is controlled by setting the classifiers’ parameters. Both values should be provided to describe the performance. Another common performance measure is the F1-measure: When dealing with multiple classes there are two possible ways of averaging these measures, namely, macro-average and micro-average. The

macro-average weights equally all the classes, regardless of how many documents belong to it. The micro-average weights equally all the documents, thus favouring the performance on common classes. Different classifiers will perform different in common and rare categories. Learning algorithms are trained more often on more populated classes thus risking local over-fitting.

Object Buffering and performances

Object Oriented Application Frameworks (OOAF) are software engineering artefacts that improve reusability of design and implementation [5,6,10].

The classification framework used in this project[HREF4] [12] was designed to provide a consistent way to build document categorization systems. It allows the developer to focus on the quality of the more critical and complex modules by allowing reuse of common, simple base modules. In this project we study how the framework can be used in a news stories classification system and extended the framework to be more scalable as required in web applications by adding an object buffering module.

One of the most serious limitations when training ML algorithms, is the amount of data required to achieve satisfactory performance and the scalability issues that arise when training such algorithms.

Memory requirements are often the first obstacle for managing large corpora, specially for a text categorization system, where the ML algorithms are being trained on very large vector spaces and large amounts of data. Most operating systems have dealt with this problems by adding support for virtual memory, but these general purpose techniques are not always enough or appropriate for special applications like ours. In those cases, applications often have an adaptive buffer management tool that allows the

application to allocate specific amounts of memory.

Figure 2: Run time object diagram

We have extended our text categorization framework [12] so it can handle

much larger corpora. Figure 2 shows the object diagram of important

classes in the run time of the framework. The circle in Figure 2 represents an

object and the arrow represents a relationship meaning a

category/document ‘HAS-A’ feature vector (FV). The document object

corresponds exactly to a document in the training or testing sets. The

category object corresponds exactly to a category. For the RCV1, there are

over 800K documents, so the framework needs to manage that number of

document objects and the same number of feature vectors. The object

buffer can be configured to store as many objects as possible in RAM

memory, and requesting other ones from the hard disk as they are required.

Thus, there is only a constant amount of objects in the memory, and the

system can be optimised for the available resources and independently on

the amount of data.

The buffered object module we implemented is described in Figure 3. The

module supports a unique persistent Object Identifier (OID) for all objects

stored in it. The framework will then use this OID instead of the object or its

reference. If the framework needs an object to be stored in the buffered

object module, it makes a request to the buffered object manager using the

OID. The buffered object manager first searches for the request object in the

object pool. If object manager finds it, then it returns the object’s reference,

otherwise it requests the object to the buffered file manager which tries to

minimize I/O to the disk. The buffered file manager reads the request object

from the disk, and returns it to the buffered object manager. As the buffered

object manager receives the object from the buffered file manager, it firstly

stores the received object to the buffer pool and returns it to the requester

in the framework.

Figure 3: Architecture of the buffered object module

In order to optimise memory usage, the buffered object manager uses a

constant amount of memory. Our current implementation follows the Least

Recently Used (LRU) strategy [1]. In this strategy, common for paging in

operating systems, the least recently used (read or written) object (or page)

is selected to be taken out of the buffer (paged out). The same rule is also

used in other cache systems when they select which cache entry to flush. In

future work we plan to add new buffering strategies such as: Most Recently

Used (MRU) and others. We expect that the selection of these strategies will

depend on each Machine Learning algorithm. For example, Naive Bayes

might work better with a LRU and kNN with a MRU strategy.

With a scalable framework we are able to show the feasibility of

html document是什么automatically classifying large amounts of news stories. Since news stories

in the RCV1 dataset are in newsML format the extensions to the framework

need to include more efficient XML parsing. We first tried the Perl and C implementations of the SAX package. Adapting one of the libraries

implemented in C and integrating it into the framework produced the best

results.

Results

We have tested the extended framework on the Reuters RCV1 data in order

to:

· Assess the feasibility of automatic document classification on large-scale management of news stories.

· Measure the classification performance of Naive Bayes and kNN

classifiers on this new corpus, and find ways to improve it.

· Measure the improvements over the non-buffered framework in

computing time and memory management. We performed three series of experiments: one to determine how Naive

Bayes and kNN algorithms performed for different amounts of training data,

a second one to see if using the newsML structure could improve

performance by giving different weights to the title of the story, and finally

how they scaled ?in computation time- for different amounts of data.

Although the precision of the classification itself should not be modified by

688IT编程网

什么是宏平均(macro-average)和微平均(micro-average)

发表评论

推荐文章

高中英语2024届高考读后续写一句多译汇总(共40句)

综合教程3Unit1-Unit8英译汉以及汉译英

八年级英语下册关于做家务的作文

关于环保英语演讲稿

knock out考研英语

热门文章

turnaround中文翻译

signout_time翻译中文

insideout用中文具体怎么翻译

point翻译中文

came out的翻译

carry out翻译

cut out中文翻译

run out中文翻译

set out中文翻译

2020高考英语备考BookIII Units 3-4 知识搜索与探究归纳

安徽省专升本英语模拟题8_真题-无答案

浙江省名校新高考研究联盟(Z20名校联盟)2022届高三英语第一次联考试题...

备考2012英语新题精选:七选五真题(含答案解析)

大学体验英语课后填空全

某图书馆正在招募图书管理员的英语作文

2023年河南省鹤壁市统招专升本英语自考真题(含答案)

执信中学初二英语期中考试试卷解析

高考英语贵州卷阅读理解历年真题及答案解析

2023届山东省济南市高三三模英语试题及答案

广西南宁市东盟中学2023-2024学年高二上学期开学考英语试题

最新文章

八年级英语下册关于做家务的作文

信用卡申诉英语作文

有关整理东西的英语作文三十词

最新带out的动词短语相关短语有哪些

汉译英专项练习

gave out翻译中文

标签列表

688IT编程网

什么是宏平均(macro-average)和微平均(micro-average)

发表评论

推荐文章

高中英语2024届高考读后续写一句多译汇总(共40句)

综合教程3Unit1-Unit8英译汉以及汉译英

八年级英语下册关于做家务的作文

关于环保英语演讲稿

knock out考研英语

热门文章

turnaround中文翻译

signout_time翻译中文

insideout用中文具体怎么翻译

point翻译中文

came out的翻译

carry out翻译

cut out中文翻译

run out中文翻译

set out中文翻译

2020高考英语 备考BookIII Units 3-4 知识搜索与探究归纳

安徽省专升本英语模拟题8_真题-无答案

浙江省名校新高考研究联盟(Z20名校联盟)2022届高三英语第一次联考试题...

备考2012英语新题精选:七选五真题(含答案解析)

大学体验英语课后填空全

某图书馆正在招募图书管理员的英语作文

2023年河南省鹤壁市统招专升本英语自考真题(含答案)

执信中学初二英语期中考试试卷解析

高考英语贵州卷阅读理解历年真题及答案解析

2023届山东省济南市高三三模英语试题及答案

广西南宁市东盟中学2023-2024学年高二上学期开学考英语试题

最新文章

八年级英语下册关于做家务的作文

信用卡申诉 英语作文

有关整理东西的英语作文三十词

最新带out的动词短语相关短语有哪些

汉译英专项练习

gave out翻译中文

标签列表

2020高考英语备考BookIII Units 3-4 知识搜索与探究归纳

信用卡申诉英语作文