附录1 1.2数据库管理系统的结构这一节我们将简要介绍典型的数据库管理系统的结构。我们还要看看DBMS是如何处理用户的查询和其他数据库操作的。最后我们要考虑几个问题在设计既要容纳海量数据又要实现高速查询的数据库管理系统时将会碰到这些问题。但DBMS的实现技术不是本书的主题我们关心的是如何有效地设计和使用数据库。模式更新查询更新“查询” 处理程序存储管理程序数据元数据事务管理程序图1.1 DBMS的主要组成部分1.2.1DBMS的组成概述图1.1给出了数据库管理系统的主要组成部分。最底部表示存储数据的地方。习惯上用圆盘形来表示存储数据的地方。注意我们在这里标注的不仅有“数据”还有“元数
据”metadata——有关数据结构的信息。比如如果这个DBMS是关系的那么元数据就包括关系名、这些关系的属性名以及属性的数据类型如整型或长度为20的字符串。通常数据库管理系统需要维护数据的索引。索引是一种数据结构能帮助我们迅速到数据项并给出其部分值最常见的索引的例子是对于特定关系的某一属性查该属性为给定值的所有元组。例如存放帐号和结余的关系也许有帐号的索引这样我们就能很快地查到某个给定帐号的结余。索引是数据的一部分而说明哪些属性有索引则是元数据的一部分。索引是如何实现的读者可能在数据结构课程中学过哈希Hash 表是建立索引的一种很有效的方法。的确哈希表在早期的DBMS中得到了广泛应用。而现在最常用的数据结构叫做平衡树balanced-tree简记为B-tree。平衡树是平衡二叉检索树的扩展。二叉树的每个节点最多有两个子节点而平衡树的每个节点可以有许多子节点。鉴于平衡树一般出现在磁盘上而不是内存中人们在设计时就让平衡树的每个节点占据一个磁盘声block。由于
典型系统中磁盘声的大小为2124096字节平衡树的每个块可能有几百个指向子节点的指针。所以平衡树的检索很少到三层以上。磁盘操作的真正开销在于访问的磁盘块的多少。因此作为典型的例子只检测三个磁盘的平衡树检索比二叉树检索效率高得多因为后者往往需要访问不同磁盘块上的许多节点。平衡树和二叉树的这种区别从一个侧面说明了磁盘上的最佳数据结构与在内存中运行的算法的最佳数据结构是不同的。在图1.1 中我们还能看到存储管理程序它的任务是从数据存储器获得想要查询的信息并在接到上层的更新请求时更新相应的信息。DBMS的另一个组成部分是查询处理程序不过这个名字有点不太恰当。因为它不仅负责查询而且负责发出更新数据或元数据的请求。它的任务是接受一个操作请求后到最佳的执行方式然后向存储管理程序发出命令使其执行。事务管理程序负责系统的完整性。它必须保证同时运行的若干个查询不互相冲突保证系统在出现系统故障时不丢失数据。事务管理程序要与查询处理程序互相配合因为它必须知道当前查询将要操作的数据以免出现冲突为了避免冲突的发生也许需要延迟一些查询或操作。事务管理程序也与存储管理程序互相配合因为保护数据模式一般需要一个“日志”文件记录历次数据的更新。如果操作顺序正确的话日志文件将会记载更新的记录从而使系统出现故障时根本没有执行的操作在其后能重新执行。在图1.1的最上方是三种类型的DBMS的输入1、查询。查询就是对数据的询问有两种不同的方式a通过通用的查询接口。比如关系数据库管理系统允许用户键入SQL查询语句然后将查询传给查询处理程序并给出回答。b通过应用程序接口。典型的DBMS 允许程序员通过程序应用调用DBMS来查询数据库。比如使用飞机订票系统的代理可能通过运行应用程序查询数据库了解航班的情况。可能过专门的接口提出查询要求接口中也许包括填城市名和时间之类的对话框。通过这种接口你并不能进行任意
的查询但对于合适的查询这种方式通常比直接也SQL语句更容易。2、更新。更新是指更新数据的操作。财查询一样更新操作也可以通过通用的接口或应用程序口来提出。3、模式更新。模式更新命令一般由被授予了一定权限的人使用有时我们称这些人为数据库管理员他们能够更改数据库模式或者建立新的数据库。比如假设查询和报告系统Inquiry and Reporting System IRS要求银行报告顾客的社会保险号和他们的利息那么银行就需要在存放顾客信息的关系中加入一个新的属性——socialSecurityNo社会保险号。1.2.2存储管理程序在简单的数据库系统中存储管理程序也许就是底层操作系统的文件系统。但为了提高效率DBMS往往直接控制磁盘存储器至少在某些情况下是这样。存储管理程序包括两个部分——缓冲区管理程序和文件管理程序。1、文件管理程序对文件在磁盘上的位置保持跟踪并且负责取出一个或几个数据块而数据块中含有缓冲区管理程序所要求的文件。磁盘通常划分成一个个边疆存储的数据块每个数据块能容纳许多字节从212到214大约4000到16000字节之间。2、缓冲区管理程序控制处理主存。它通过文件管理程序从磁盘取得数据块并选择主存的一个页面存放其中的一块。缓冲区管理程序会把数据块在主存中保留一段时间但当另一个数据块需要使用该页面时就把原数据块写回磁盘。当然如果事务管理程序发出请求缓冲区管理程序也会把数据块写回磁盘参见1.2.4节。
1.2.3 查询处理程序查询处理程序的任务是把很高级的语言表示的查询或数据库操作如SQL查询语句转换成对存储器数据——比如某个关系的特定元组或部分索引——的请求序列。通常查询处理任务最困难的部分是查询优化也就是说选择好的查询计划即对存储器系统选择好的请求序列以回答所要求的查询。
例1.2 假设某银行有一个包含两种关系的数据库1、Customers顾客是一张表给出每个顾客的姓名、社会保险号和地址。2、Accounts帐户是一张表给出每个帐户的帐号、结余和户主的社会保险号。注意每个帐户都有一个主户主其社会保险号将用于税款报告当然一个帐户也可能还有其他的户主但这两种关系里不包含此类信息。假设查询是“查以Sally Jones为户主的所有帐户的结余”。查询处理程序必须到在这两种关系上实施的查询方案从而得到查询结果。回答查询所需的步骤越少查询方案就越好。通常开销大的步骤是存储管理程序把磁盘数据块复制到缓冲池的内存页面或由内存页面写回磁盘。因此在评价查询方案的代价时只考虑这些磁盘操作是合理的。为了回答这项查询我们需要从Customers关系是到Sally Jones的社会保险号这里我们假设只有一个顾客叫Sally Jones虽然事实上可能不止一个。然后再从Accounts关系查具有该社会保险号的每个帐户并打印这些帐户的结余。一种简单但代价很高的方案是检查Customers关系的所有元组直到我们在顾客姓名域中到Sally Jones。平均来讲我们需要查半数的元组才能到目标。鉴于银行有许多的顾客Customers关系会占据许多的磁盘块这种做法代价很高。即便我们到了Sally Jones的社会保险号还远没有结束。我们还得查帐户关系中的元组以到指定的社会保险号对应的元组。由于这样的帐户可能有几个我们不得不查所有的元组。典型的银行可能会有许多帐户因而Accounts关系也必将占据许多磁盘块。检查所有的元组付出的代价就太大了。如果Customers关系中有姓名索引那么查询就容易多了。我们只需要使用索引到含有Sally Jones的元组所在的磁盘块而不是查整个Customers关系。正如我们在1.2.1节的方框内所讲的为了到想要的索引典型的平衡索引需要查三个索引磁盘块。
①我们再访问一个磁盘块就可到Sally Jones所在的元组。当然我们还需要做第二
步在Accounts关系中寻Sally Jones的社会保险号对应的帐户。这一步往往需要访问磁盘很多次。但如果针对Accounts关系存在一个社会保险号索引那么检索该索引我们就能到对应于给定社会保险号的帐户所在的每个磁盘块。为了做到这一点就像我们讨论过的检索Customers关系的索引那样必须用两到三次磁盘访问来检索索引。如果所要的元组分布在不同的磁盘块上也许我们需要逐个访问这些磁盘块。但很可能一个Customers没有那么多的帐户因此这步操作可能只包含几次磁盘访问。如果这两个索引都有的话也许通过6—10次磁盘访问我们就能回答上述查询了。如果只有一个索引或者两个索引都没有而不得不用比较差的查询方案那么由于我们要扫描整个关系而关系的规模又很大的话磁盘访问的次数就可能是几百或几千的数量级。也许例1.2会误导出一个结论——查询优化所做的一切不过是使用现有的索引而已。事实上这方面还有很多的内容。复杂的查询往往允许我们改变操作顺序也许会有很多种可能的查询方案一般其数量是查询规模的指数函数。有时我们能选择使用一个索引但不能使用两个。这方面的研究是数据库管理系统实现的重要方面之一但超出了本书的讨论范围。附录2 1.2 The Architecture of a DBMS In this section we shall sketch the structure of a typical database management system. We shall also look at what the DBMS does to process user queries and other database operations. Finally we shall consider some of the problems that come up in designing a DBMS that can maintain large amounts of data and process a high rate of queries. The technology for implementing a DBMS is not th
数据库管理员英文e subject of this book however we concentrate on how databases are designed and used effectively. Schema Modifications Queries Modifications “Query” Processor Storage Manager Data Metadata Transaction Manager Figure1.1:Major components of a DBMS 1.2.1 Overview of DBMS Components
Figure1.1 shows the essential parts of a DBMS. At the bottom we see a represent at of the place where data is stored. By convention disk-shaped components indicate a place for storage of data. Note that we have indicated that this component contains not only data but also metadata which is information about the structure of the data. For example if this DBMS is relational the metadata includes the names of the relations the names of the attributes of those relations and the data types for those integer of character string of length 20. How Indexes Are Implemented The reader may have learned in a course on data structures that a hash table is a very efficient way to build an index. Early DBMS’s did use hash tables extensively. Today the most common data structure is called a B-tree the “B” stands for “balanced.” A B-tree is a generalization of a balanced binary search tree. However while each node of a binary tree has up to two children the B-tree nodes have a large number of children. Given that B-trees normally appear on disk rather than in main memory the B-tree is designed so that each node occupies a full disk block. Since typical systems use disk blocks on the order of 212 bytes 4096 bytes there can be hundreds of pointers to children in a single block of a B-tre
e. there can be hundreds of pointers to children in a single block of a B-tree. Thus search of a B-tree rarely involves more than three levels. The true cost of disk operations generally is proportional to the number of disk blocks accessed. Thus searches of a B-tree which typically examine only three disk blocks are much more efficient than would be a binary-tree search which typically visits nodes found on many different disk blocks. This distinction between
B-trees and binary search trees is but one of many examples where the most appropriate data structure for data soared on disk is different from the data structures used for algorithms that run in main memory. Often a DBMS maintains indexes for the data. An
index is a data structure that helps us find data items quickly given a part of their value the most common example is an index that will find those topples of a particular relation account numbers and balances might have an index on account-number so that we can find the balance given an account number quickly. Indexes are part of the stored data and a description of which attributes have indexes is part of the metadata. In Fig.1.1 we also see a storage manager whose job it is to obtain requested information from the data storage and to modify the information there when requested to by the levels of system above it. We also see a component that we have called the query processor although that name is somewhat of a misnomer. It handles not only queries but requests for modification of the data
or the metadata. Its job is to find the best way to carry out a requested operation and to issue commands to the storage manager that will carry them out. The transaction manager component is responsible for the integrity of the system. It must assure that several queries running simultaneously do not interfere with each other and that the system will not lose data even if there is data is being operated upon by the current queriesin order to avoid conflicting actions and it may need to delay certain queries or operations so that these confliction of data usually involve storing along of changes to the data. By properly ordering operations the log will contain a record of changes so that after a system failure even those changes that never reached the disk can be executed. At the top of Fig.1.1 we see three types of inputs to the DBMS: 1.Queries. These are questions about the data. They are generated in two different ways: aThrough a generic query interface. For example a relational DBMS allows the user to type SQL queries that are passed to the query processor and answered. bThrough application program interfaces. A typical DBMS allows programmers to write application programs that through calls to the DBMS query the database. For example an agent using an airline reservation system is running an application program that queries the database about flight. Availabilities. The queries are submitted through a specialized interface that might include boxes to be filled in with cities times and so on. One cannot ask arbitrary queries through this interface but it is generally easier to ask an appropriate query through this interface than to write the query directly in SQL. 2. Modifications.
These are operations to modify the data. Like queries they can be issued either through the generic interface or through the interface of an application program. 3. Schema Modifications. These commands are usually issued by authorized personnel sometimes called database administrators who are allowed to change the schema of the database or create a new database. For example when the IRS started requiring banks to report interest payments along with each customer’s Social Security number a bank may have had to add a new attribute social Security No to a relation that stored information about customers. 1.2.2 The Storage Manager In a simple database system the storage manager might be nothing more than the file system of the underlying operating system. However for efficiency purposes DBMS’s normally control storage on the disk directly at least under some circumstances. The storage manager consists of two components the buffer manager and the file manager. 1. The file manager keeps track of the location of files on the disk and obtains the block or blocks containing a file on request from the buffer manager. Recall that disks are generally divided into disk blocks which are regions of contiguous storage containing a large number of bytes perhaps 212 or 214about 4000 to 16000 bytes. 2. The buffer manager handles main memory. It obtains blocks of data from the disk via the file manager and chooses a page of main memory in which to store that block. The buffer
manager may keep a disk block in memory is needed for another block. Pages are also returned to disk when the transaction manager requires is see Section 1.2.4. 1.2.3 The Query Manager The job of the query manager is to turn a query or database manipulation. Which may be expressed at a very high as an SQL query into a sequence of requests for stored data such as specific topples of a relation or parts of an index on a relation. Often the hardest part of the query-processing task is query optimization that is the selection of a good query plan or sequence of requests to the storage system that will answer the query. Example 1.2: suppose that a bank has a database with two relations: 1.Customers is a table giving for each account its account number and address. 2.Accunts is a table giving for each account has a principal owner whose Social Security number is used for tax-reporting purposes there may be other owners of an account but these cannot be known from the two relations given here. Suppose also that the query “find the balances of all accounts of which Sal ly Jones is the principal owner” is asked. The query manager must find a query plan to perform of these relations a plan that will yield the answer to the query. The fewer steps taken to answer the query the better the query plan is In general the costly steps are those in which a disk block is copied from the disk into a page of the buffer pool by the storage manager or a page is written back onto the disk. Thus it is reasonable to count only these disk-block operations in evaluating the cost of a query pl an. In order to answer the query we need to examine the customer’s relation to find the Social Security Num
ber of Sally Joneswe assume there is only one customer with that name although in practice there could be several. We then need to examine the Accounts relation to find every account with that Social Security number and print the balances of those accounts. A simple but expensive plan is to examine all the topples rows of the customer’s relation until we find one with Sally Jones as the customer name. On average we shall have to look at half of the topples before we find the one we want. Since a bank will have many customers the customer’s relation will occupy many di.