P. 1

Information Retrieval Data Structures & Algorithms - William B. Frakes

Ratings: (3)|Reads: 22,886|Likes: 166

Published by PRavi KIran

More info:

Published by: PRavi KIran over 4 years ago

Availability:

Read on Scribd mobile: iPhone, iPad and Android.

Free download as PDF, TXT or read online for free from Scribd

Flag for inappropriate content|Add to collection

Information Retrieval: Table of Contents

Information Retrieval: Data Structures & Algorithms
edited by William B. Frakes and Ricardo Baeza-Yates

FOREWORD PREFACE CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL SYSTEMS CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND ALGORITHMS RELATED TO INFORMATION RETRIEVAL CHAPTER 3: INVERTED FILES CHAPTER 4: SIGNATURE FILES CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND PAT ARRAYS CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS CHAPTER 8: STEMMING ALGORITHMS CHAPTER 9: THESAURUS CONSTRUCTION CHAPTER 10: STRING SEARCHING ALGORITHMS CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY MODIFICATION TECHNIQUES CHAPTER 12: BOOLEAN OPERATIONS CHAPTER 13: HASHING ALGORITHMS
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDobbs_Books_Algorithms_Collection2ed/books/book5/toc.htm (1 of 2)7/3/2004 4:19:10 PM

CHAPTER 14: RANKING ALGORITHMS CHAPTER 15: EXTENDED BOOLEAN MODELS CHAPTER 16: CLUSTERING ALGORITHMS CHAPTER 17: SPECIAL-PURPOSE HARDWARE FOR INFORMATION RETRIEVAL CHAPTER 18: PARALLEL INFORMATION RETRIEVAL ALGORITHMS

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDobbs_Books_Algorithms_Collection2ed/books/book5/toc.htm (2 of 2)7/3/2004 4:19:10 PM

Information Retrieval: FOREWORD

FOREWORD
Udi Manber Department of Computer Science, University of Arizona In the not-so-long ago past, information retrieval meant going to the town's library and asking the librarian for help. The librarian usually knew all the books in his possession, and could give one a definite, although often negative, answer. As the number of books grew--and with them the number of libraries and librarians--it became impossible for one person or any group of persons to possess so much information. Tools for information retrieval had to be devised. The most important of these tools is the index--a collection of terms with pointers to places where information about them can be found. The terms can be subject matters, author names, call numbers, etc., but the structure of the index is essentially the same. Indexes are usually placed at the end of a book, or in another form, implemented as card catalogs in a library. The Sumerian literary catalogue, of c. 2000 B.C., is probably the first list of books ever written. Book indexes had appeared in a primitive form in the 16th century, and by the 18th century some were similar to today's indexes. Given the incredible technology advances in the last 200 years, it is quite surprising that today, for the vast majority of people, an index, or a hierarchy of indexes, is still the only available tool for information retrieval! Furthermore, at least from my experience, many book indexes are not of high quality. Writing a good index is still more a matter of experience and art than a precise science. Why do most people still use 18th century technology today? It is not because there are no other methods or no new technology. I believe that the main reason is simple: Indexes work. They are extremely simple and effective to use for small to medium-size data. As President Reagan was fond of saying "if it ain't broke, don't fix it." We read books in essentially the same way we did in the 18th century, we walk the same way (most people don't use small wheels, for example, for walking, although it is technologically feasible), and some people argue that we teach our students in the same way. There is a great comfort in not having to learn something new to perform an old task. However, with the information explosion just upon us, "it" is about to be broken. We not only have an immensely greater amount of information from which to retrieve, we also have much more complicated needs. Faster computers, larger capacity high-speed data storage devices, and higher bandwidth networks will all come along, but they will not be enough. We will need better techniques for storing, accessing, querying, and manipulating information. It is doubtful that in our lifetime most people will read books, say, from a notebook computer, that people will have rockets attached to their backs, or that teaching will take a radical new form (I dare not even venture what form), but it is likely that information will be retrieved in many new ways, but many more people, and on a grander scale.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDob...ooks_Algorithms_Collection2ed/books/book5/foreword.htm (1 of 2)7/3/2004 4:19:16 PM

Information Retrieval: FOREWORD

I exaggerated, of course, when I said that we are still using ancient technology for information retrieval. The basic concept of indexes--searching by keywords--may be the same, but the implementation is a world apart from the Sumerian clay tablets. And information retrieval of today, aided by computers, is not limited to search by keywords. Numerous techniques have been developed in the last 30 years, many of which are described in this book. There are efficient data structures to store indexes, sophisticated query algorithms to search quickly, data compression methods, and special hardware, to name just a few areas of extraordinary advances. Considerable progress has been made for even seemingly elementary problems, such as how to find a given pattern in a large text with or without preprocessing the text. Although most people do not yet enjoy the power of computerized search, and those who do cry for better and more powerful methods, we expect major changes in the next 10 years or even sooner. The wonderful mix of issues presented in this collection, from theory to practice, from software to hardware, is sure to be of great help to anyone with interest in information retrieval. An editorial in the Australian Library Journal in 1974 states that "the history of cataloging is exceptional in that it is endlessly repetitive. Each generation rethinks and reformulates the same basic problems, reframing them in new contexts and restating them in new terminology." The history of computerized cataloging is still too young to be in a cycle, and the problems it faces may be old in origin but new in scale and complexity. Information retrieval, as is evident from this book, has grown into a broad area of study. I dare to predict that it will prosper. Oliver Wendell Holmes wrote in 1872 that "It is the province of knowledge to speak and it is the privilege of wisdom to listen." Maybe, just maybe, we will also be able to say in the future that it is the province of knowledge to write and it is the privilege of wisdom to query. Go to Preface Back to Table of Contents

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDob...ooks_Algorithms_Collection2ed/books/book5/foreword.htm (2 of 2)7/3/2004 4:19:16 PM

Information Retrieval: PREFACE

PREFACE
Text is the primary way that human knowledge is stored, and after speech, the primary way it is transmitted. Techniques for storing and searching for textual documents are nearly as old as written language itself. Computing, however, has changed the ways text is stored, searched, and retrieved. In traditional library indexing, for example, documents could only be accessed by a small number of index terms such as title, author, and a few subject headings. With automated systems, the number of indexing terms that can be used for an item is virtually limitless. The subfield of computer science that deals with the automated storage and retrieval of documents is called information retrieval (IR). Automated IR systems were originally developed to help manage the huge scientific literature that has developed since the 1940s, and this is still the most common use of IR systems. IR systems are in widespread use in university, corporate, and public libraries. IR techniques have also been found useful, however, in such disparate areas as office automation and software engineering. Indeed, any field that relies on documents to do its work could potentially benefit from IR techniques. IR shares concerns with many other computer subdisciplines, such as artificial intelligence, multimedia systems, parallel computing, and human factors. Yet, in our observation, IR is not widely known in the computer science community. It is often confused with DBMS--a field with which it shares concerns and yet from which it is distinct. We hope that this book will make IR techniques more widely known and used. Data structures and algorithms are fundamental to computer science. Yet, despite a large IR literature, the basic data structures and algorithms of IR have never been collected in a book. This is the need that we are attempting to fill. In discussing IR data structures and algorithms, we attempt to be evaluative as well as descriptive. We discuss relevant empirical studies that have compared the algorithms and data structures, and some of the most important algorithms are presented in detail, including implementations in C. Our primary audience is software engineers building systems with text processing components. Students of computer science, information science, library science, and other disciplines who are interested in text retrieval technology should also find the book useful. Finally, we hope that information retrieval researchers will use the book as a basis for future research. Bill Frakes Ricardo Baeza-Yates

ACKNOWLEDGEMENTS
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDob...Books_Algorithms_Collection2ed/books/book5/preface.htm (1 of 2)7/3/2004 4:19:18 PM

Information Retrieval: PREFACE

Many people improved this book with their reviews. The authors of the chapters did considerable reviewing of each others' work. Other reviewers include Jim Kirby, Jim O'Connor, Fred Hills, Gloria Hasslacher, and Ruben Prieto-Diaz. All of them have our thanks. Special thanks to Chris Fox, who tested The Code on the disk that accompanies the book; to Steve Wartik for his patient unravelling of many Latex puzzles; and to Donna Harman for her helpful suggestions. Go to Chapter 1 Back to Table of Contents

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDob...Books_Algorithms_Collection2ed/books/book5/preface.htm (2 of 2)7/3/2004 4:19:18 PM

Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE

CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL SYSTEMS
W. B. Frakes Software Engineering Guild, Sterling, VA 22170 Abstract This chapter introduces and defines basic IR concepts, and presents a domain model of IR systems that describes their similarities and differences. The domain model is used to introduce and relate the chapters that follow. The relationship of IR systems to other information systems is dicussed, as is the evaluation of IR systems.

1.1 INTRODUCTION
Automated information retrieval (IR) systems were originally developed to help manage the huge scientific literature that has developed since the 1940s. Many university, corporate, and public libraries now use IR systems to provide access to books, journals, and other documents. Commercial IR systems offer databases containing millions of documents in myriad subject areas. Dictionary and encyclopedia databases are now widely available for PCs. IR has been found useful in such disparate areas as office automation and software engineering. Indeed, any discipline that relies on documents to do its work could potentially use and benefit from IR. This book is about the data structures and algorithms needed to build IR systems. An IR system matches user queries--formal statements of information needs--to documents stored in a database. A document is a data object, usually textual, though it may also contain other types of data such as photographs, graphs, and so on. Often, the documents themselves are not stored directly in the IR system, but are represented in the system by document surrogates. This chapter, for example, is a document and could be stored in its entirety in an IR database. One might instead, however, choose to create a document surrogate for it consisting of the title, author, and abstract. This is typically done for efficiency, that is, to reduce the size of the database and searching time. Document surrogates are also called documents, and in the rest of the book we will use document to denote both documents and document surrogates. An IR system must support certain basic operations. There must be a way to enter documents into a database, change the documents, and delete them. There must also be some way to search for documents, and present them to a user. As the following chapters illustrate, IR systems vary greatly in the ways they accomplish these tasks. In the next section, the similarities and differences among IR systems are discussed.

1.2 A DOMAIN ANALYSIS OF IR SYSTEMS
This book contains many data structures, algorithms, and techniques. In order to find, understand, and use them effectively, it is necessary to have a conceptual framework for them. Domain analysis--systems analysis for multiple related systems--described in Prieto-Diaz and Arrango (1991), is a method for developing such a
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap01.htm (1 of 11)7/3/2004 4:19:21 PM

Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE

framework. Via domain analysis, one attempts to discover and record the similarities and differences among related systems. The first steps in domain analysis are to identify important concepts and vocabulary in the domain, define them, and organize them with a faceted classification. Table 1.1 is a faceted classification for IR systems, containing important IR concepts and vocabulary. The first row of the table specifies the facets--that is, the attributes that IR systems share. Facets represent the parts of IR systems that will tend to be constant from system to system. For example, all IR systems must have a database structure--they vary in the database structures they have; some have inverted file structures, some have flat file structures, and so on. A given IR system can be classified by the facets and facet values, called terms, that it has. For example, the CATALOG system (Frakes 1984) discussed in Chapter 8 can be classified as shown in Table 1.2. Terms within a facet are not mutually exclusive, and more than one term from a facet can be used for a given system. Some decisions constrain others. If one chooses a Boolean conceptual model, for example, then one must choose a parse method for queries. Table 1.1: Faceted Classification of IR Systems (numbers in parentheses indicate chapters) Conceptual Model File Structure Query Operations Term Operations Document Operations Hardware

-----------------------------------------------------------------------------Boolean(1) Extended Flat File(10) Inverted Feedback(11) Parse(3,7) Stem(8) Weight(14) Parse(3,7) Display vonNeumann(1) Parallel(18)

Boolean(15) File(3) Probabilistic(14) String Search(10) Vector Space(14) Hashing(13) Graphs(1) Truncation (10) Field Mask(1) Sort(1) Pat Trees(5) Cluster(16) Signature(4) Boolean(12) Thesaurus (9) Stoplist(7) Rank(14) Cluster(16) IR Specific(17) Optical Disk(6) Mag. Disk(1)

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap01.htm (2 of 11)7/3/2004 4:19:21 PM

Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE

Assign IDs(3) Table 1.2: Facets and Terms for CATALOG IR System Facets Terms

----------------------------------------------------------------File Structure Query Operations Term Operations Hardware Document Operations Conceptual Model Inverted file Parse, Boolean Stem, Stoplist, Truncation von Neumann, Mag. Disk parse, display, sort, field mask, assign IDs Boolean

Viewed another way, each facet is a design decision point in developing the architecture for an IR system. The system designer must choose, for each facet, from the alternative terms for that facet. We will now discuss the facets and their terms in greater detail.

1.2.1 Conceptual Models of IR
The most general facet in the previous classification scheme is conceptual model. An IR conceptual model is a general approach to IR systems. Several taxonomies for IR conceptual models have been proposed. Faloutsos (1985) gives three basic approaches: text pattern search, inverted file search, and signature search. Belkin and Croft (1987) categorize IR conceptual models differently. They divide retrieval techniques first into exact match and inexact match. The exact match category contains text pattern search and Boolean search techniques. The inexact match category contains such techniques as probabilistic, vector space, and clustering, among others. The problem with these taxonomies is that the categories are not mutually exclusive, and a single system may contain aspects of many of them. Almost all of the IR systems fielded today are either Boolean IR systems or text pattern search systems. Text pattern search queries are strings or regular expressions. Text pattern systems are more common for searching small collections, such as personal collections of files. The grep family of tools, described in Earhart (1986), in the UNIX environment is a well-known example of text pattern searchers. Data structures and algorithms for text pattern searching are discussed in Chapter 10. Almost all of the IR systems for searching large document collections are Boolean systems. In a Boolean IR system, documents are represented by sets of keywords, usually stored in an inverted file. An inverted file is a list of keywords and identifiers of the documents in which they occur. Boolean list operations are discussed in Chapter 12. Boolean queries are keywords connected with Boolean logical operators (AND, OR, NOT). While Boolean systems have been criticized (see Belkin and Croft [1987] for a summary), improving their retrieval effectiveness has been
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap01.htm (3 of 11)7/3/2004 4:19:21 PM

Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE

difficult. Some extensions to the Boolean model that may improve IR performance are discussed in Chapter 15. Researchers have also tried to improve IR performance by using information about the statistical distribution of terms, that is the frequencies with which terms occur in documents, document collections, or subsets of document collections such as documents considered relevant to a query. Term distributions are exploited within the context of some statistical model such as the vector space model, the probabilistic model, or the clustering model. These are discussed in Belkin and Croft (1987). Using these probabilistic models and information about term distributions, it is possible to assign a probability of relevance to each document in a retrieved set allowing retrieved documents to be ranked in order of probable relevance. Ranking is useful because of the large document sets that are often retrieved. Ranking algorithms using the vector space model and the probabilistic model are discussed in Chapter 14. Ranking algorithms that use information about previous searches to modify queries are discussed in Chapter 11 on relevance feedback. In addition to the ranking algorithms discussed in Chapter 14, it is possible to group (cluster) documents based on the terms that they contain and to retrieve from these groups using a ranking methodology. Methods for clustering documents and retrieving from these clusters are discussed in Chapter 16.

1.2.2 File Structures
A fundamental decision in the design of IR systems is which type of file structure to use for the underlying document database. As can be seen in Table 1.1, the file structures used in IR systems are flat files, inverted files, signature files, PAT trees, and graphs. Though it is possible to keep file structures in main memory, in practice IR databases are usually stored on disk because of their size. Using a flat file approach, one or more documents are stored in a file, usually as ASCII or EBCDIC text. Flat file searching (Chapter 10) is usually done via pattern matching. On UNIX, for example, one can store a document collection one per file in a UNIX directory, and search it using pattern searching tools such as grep (Earhart 1986) or awk (Aho, Kernighan, and Weinberger 1988). An inverted file (Chapter 3) is a kind of indexed file. The structure of an inverted file entry is usually keyword, document-ID, field-ID. A keyword is an indexing term that describes the document, document-ID is a unique identifier for a document, and field-ID is a unique name that indicates from which field in the document the keyword came. Some systems also include information about the paragraph and sentence location where the term occurs. Searching is done by looking up query terms in the inverted file. Signature files (Chapter 4) contain signatures--it patterns--that represent documents. There are various ways of constructing signatures. Using one common signature method, for example, documents are split into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist (see below), words. Each word in the block is hashed to give a signature--a bit pattern with some of the bits set to 1. The signatures of each word in a block are OR'ed together to create a block signature. The block signatures are then concatenated to produce the document signature. Searching is done by comparing the signatures of queries with document signatures. PAT trees (Chapter 5) are Patricia trees constructed over all sistrings in a text. If a document collection is viewed as a sequentially numbered array of characters, a sistring is a subsequence of characters from the array starting at a given point and extending an arbitrary distance to the right. A Patricia tree is a digital tree where the individual bits of the keys are used to decide branching.
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap01.htm (4 of 11)7/3/2004 4:19:21 PM

Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE

Graphs, or networks, are ordered collections of nodes connected by arcs. They can be used to represent documents in various ways. For example, a kind of graph called a semantic net can be used to represent the semantic relationships in text often lost in the indexing systems above. Although interesting, graph-based techniques for IR are impractical now because of the amount of manual effort that would be needed to represent a large document collection in this form. Since graph-based approaches are currently impractical, we have not covered them in detail in this book.

1.2.3 Query Operations
Queries are formal statements of information needs put to the IR system by users. The operations on queries are obviously a function of the type of query, and the capabilities of the IR system. One common query operation is parsing (Chapters 3 and 7), that is breaking the query into its constituent elements. Boolean queries, for example, must be parsed into their constituent terms and operators. The set of document identifiers associated with each query term is retrieved, and the sets are then combined according to the Boolean operators (Chapter 12). In feedback (Chapter 11), information from previous searches is used to modify queries. For example, terms from relevant documents found by a query may be added to the query, and terms from nonrelevant documents deleted. There is some evidence that feedback can significantly improve IR performance.

1.2.4 Term Operations
Operations on terms in an IR system include stemming (Chapter 8), truncation (Chapter 10), weighting (Chapter 14), and stoplist (Chapter 7) and thesaurus (Chapter 9) operations. Stemming is the automated conflation (fusing or combining) of related words, usually by reducing the words to a common root form. Truncation is manual conflation of terms by using wildcard characters in the word, so that the truncated term will match multiple words. For example, a searcher interested in finding documents about truncation might enter the term "truncat?" which would match terms such as truncate, truncated, and truncation. Another way of conflating related terms is with a thesaurus which lists synonymous terms, and sometimes the relationships among them. A stoplist is a list of words considered to have no indexing value, used to eliminate potential indexing terms. Each potential indexing term is checked against the stoplist and eliminated if found there. In term weighting, indexing or query terms are assigned numerical values usually based on information about the statistical distribution of terms, that is, the frequencies with which terms occur in documents, document collections, or subsets of document collections such as documents considered relevant to a query.

1.2.5 Document Operations
Documents are the primary objects in IR systems and there are many operations for them. In many types of IR systems, documents added to a database must be given unique identifiers, parsed into their constituent fields, and those fields broken into field identifiers and terms. Once in the database, one sometimes wishes to mask off certain fields for searching and display. For example, the searcher may wish to search only the title and abstract fields of documents for a given query, or may wish to see only the title and author of retrieved documents. One may also wish to sort retrieved documents by some field, for example by author. There are many sorting algorithms and because of the generality of the subject we have not covered it in this book. A good description of sorting algorithms in C can be found in Sedgewick (1990). Display operations include printing the documents, and
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap01.htm (5 of 11)7/3/2004 4:19:21 PM

Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE

displaying them on a CRT. Using information about term distributions, it is possible to assign a probability of relevance to each document in a retrieved set, allowing retrieved documents to be ranked in order of probable relevance (Chapter 14). Term distribution information can also be used to cluster similar documents in a document space (Chapter 16). Another important document operation is display. The user interface of an IR system, as with any other type of information system, is critical to its successful usage. Since user interface algorithms and data structures are not IR specific, we have not covered them in detail here.

1.2.6 Hardware for IR
Hardware affects the design of IR systems because it determines, in part, the operating speed of an IR system--a crucial factor in interactive information systems--and the amounts and types of information that can be stored practically in an IR system. Most IR systems in use today are implemented on von Neumann machines--general purpose computers with a single processor. Most of the discussion of IR techniques in this book assumes a von Neumann machine as an implementation platform. The computing speeds of these machines have improved enormously over the years, yet there are still IR applications for which they may be too slow. In response to this problem, some researchers have examined alternative hardware for implementing IR systems. There are two approaches--parallel computers and IR specific hardware. Chapter 18 discusses implementation of an IR system on the Connection machine--a massively parallel computer with 64,000 processors. Chapter 17 discusses IR specific hardware--machines designed specifically to handle IR operations. IR specific hardware has been developed both for text scanning and for common operations like Boolean set combination. Along with the need for greater speed has come the need for storage media capable of compactly holding the huge document databases that have proliferated. Optical storage technology, capable of holding gigabytes of information on a single disk, has met this need. Chapter 6 discusses data structures and algorithms that allow optical disk technology to be successfully exploited for IR.

1.2.7 Functional View of Paradigm IR System
Figure 1.1 shows the activities associated with a common type of Boolean IR system, chosen because it represents the operational standard for IR systems.

Figure 1.1: Example of Boolean IR system When building the database, documents are taken one by one, and their text is broken into words. The words from the documents are compared against a stoplist--a list of words thought to have no indexing value. Words from the document not found in the stoplist may next be stemmed. Words may then also be counted, since the frequency of words in documents and in the database as a whole are often used for ranking retrieved documents. Finally, the words and associated information such as the documents, fields within the documents, and counts are put into the database. The database then might consist of pairs of document identifiers and keywords as follows.
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap01.htm (6 of 11)7/3/2004 4:19:21 PM

2 keyword3 .document1-Field_2 keyword2 . OR. and its fields.Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE keyword1 . if field operations are supported. 1. the retrieved set may be ranked in order of probable relevance. In some systems.document3-Field_1. DBMS. the user makes judgments about the relevance of the retrieved documents. Stemming (Chapter 8) is a technique for conflating term variants so that the semantic closeness of words like "engineer. and artificial intelligence (AI) systems? Table 1. and this information is used to modify the query automatically by adding terms from relevant documents and deleting terms from nonrelevant documents. as discussed in Chapter 9. The query is parsed into its constituent terms and Boolean operators. a user enters a query consisting of a set of keywords connected by Boolean operators (AND. One such technique aims to establish a connection between morphologically related terms. These terms are then looked up in the inverted file and the list of document identifiers corresponding to them are combined according to the specified Boolean operators. To search the database. The result of the search is then presented to the user.. or synonym lists. Systems such as this give remarkably good retrieval performance given their simplicity.3 IR AND OTHER TYPES OF INFORMATION SYSTEMS How do IR systems relate to different types of information systems such as database management systems (DBMS)." and "engineering" will be recognized in searching.document1-Field_2. In an IR system.." "engineered.htm (7 of 11)7/3/2004 4:19:21 PM . each document must have a unique identifier. NOT). but their performance is far from perfect. j Such a structure is called an inverted file.document-n-Field_i.document3-Field_3.3: IR. 5 keyword2 .3 summarizes some of the similarities and differences.Books_Algorithms_Collection2ed/books/book5/chap01. Many techniques to improve them have been proposed. Al Comparison Data Object Primary Operation Database Size file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. 4 keyword-n . Another way to relate terms is via thesauri. If frequency information has been kept. Table 1. must have unique field names.

and change documents in the database. or do not match. It must allow a user to add. a DBMS where retrieval is deterministic. That is. in general have less usable structure than the tables of data used by relational DBMS. The hope is to eventually develop practical systems that combine IR. DBMS.Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE -----------------------------------------------------------------------IR document retrieval (probabilistic) DBMS (relational) AI logical statements table retrieval (deterministic) inference usually small small to very large small to very large One difference between IR. changed. Another distinguishing feature of IR systems is that retrieval is probabilistic.4 IR SYSTEM EVALUATION IR systems can be evaluated in terms of many criteria including execution efficiency. and structures such as frames and semantic nets used by AI systems. 1. In a typical search in an IR system. It is possible. will change constantly as documents are added..htm (8 of 11)7/3/2004 4:19:21 PM . being primarily text. Commercial on-line retrieval services such as Dialog and BRS provide databases of many gigabytes. a typical IR system must meet the following functional and nonfunctional requirements. This may be contrasted with retrieval from. may contain several million records. DBMS.Books_Algorithms_Collection2ed/books/book5/chap01. and deleted.000 documents would be enormous. Selection of the best data structures and algorithms to build such systems is often critical. Another feature that IR systems share with DBMS is database volatility. some relevant documents will be missed and some nonrelevant documents will be retrieved. A typical large IR application. records in the database. In summary. and AI systems is the amount of usable structure in their data objects. Documents. The need to search such large collections in real time places severe demands on the systems used to search them. DBMS. of course. one cannot be certain that a retrieved document will meet the information need of the user. to analyze a document manually and store information about its syntax and semantics in a DBMS or an AI system. and AI. This constrains the kinds of data structures and algorithms that can be used for IR. for example. and retrieve relevant documents in response to queries interactively--often within 1 to 10 seconds. It must provide a way for users to search for documents by entering queries. Book library systems. Tong (1989). for example. The barriers for doing this to a large collection of documents are practical rather than theoretical. One feature of IR systems shared with many DBMS is that their databases are often very large--sometimes in the gigabyte range. see. and examine the retrieved documents. for example. The work involved in doing knowledge engineering on a set of say 50. and other techniques. In a DBMS. queries consist of attribute-value pairs that either match. retrieval file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. It must accommodate databases in the megabyte to gigabyte range. Researchers have devoted much effort to constructing hybrid systems using IR. AI. such as a book library system or commercial document retrieval service. storage efficiency.. delete.

Many measures of retrieval effectiveness have been proposed. Space overhead. That is. E. and b is a measure of the relative importance. That is. Storage efficiency is measured by the number of bytes needed to store data. Precision is the ratio of the number of relevant documents retrieved over the total number of documents retrieved.2 shows an example of such a plot.5. Recall-precision plots show that recall and precision are inversely related.. or part of a system. The evaluation measure E is defined as: where P = precision. b levels of . Except for small test collections. to a user. Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database. and the selection of appropriate data structures and algorithms for implementation will depend on these decisions. Since one often wishes to compare IR performance in terms of both recall and precision. The seriousness of the problem is the subject of debate. to perform a computation.Books_Algorithms_Collection2ed/books/book5/chap01. when precision goes up. different judges will assign different relevance values to a document retrieved in response to a given query. The most commonly used are recall and precision. methods for evaluating them simultaneously have been developed. Most IR experimentation has focused on retrieval effectiveness--usually based on document relevance judgments.Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE effectiveness. of recall and precision. Execution efficiency is measured by the time it takes a system.5 to 3 are typical for IR systems based on inverted files.. Experimenters choose values of E that they hope will reflect the recall and precision interests of the typical user. and a long retrieval time will interfere with the usefulness of the system. The nonfunctional requirements of IR systems usually specify maximum acceptable times for searching. Such plots can be done for individual queries. is the ratio of the size of the index files plus the size of the document files over the size of the document files. For example. recall typically goes down and vice-versa. a common measure of storage efficiency. Space overhead ratios of from 1.htm (9 of 11)7/3/2004 4:19:21 PM . with many IR researchers arguing that the relevance judgment reliability problem is not sufficient to invalidate the experiments that use relevance judgments. Both recall and precision take on values between 0 and 1. and 2. and for database maintenance operations such as adding and deleting documents. One method involves the use of recall-precision graphs--bivariate plots where one axis is recall and the other precision. has been developed by van Rijsbergen (1979). or averaged over queries as described in Salton and McGill (1983). This has been a problem since relevance judgments are subjective and unreliable. indicating that file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. indicating that a user was twice as interested in precision as recall. R = recall. this denominator is generally unknown and must be estimated by sampling or some other method.2: Recall-precision graph A combined measure of recall and precision. Figure 1. A detailed discussion of the issues involved in IR experimentation can be found in Salton and McGill (1983) and Sparck-Jones (1981). This can be measured in C based systems by using profiling tools such as prof (Earhart 1986) on UNIX. and van Rijsbergen (1979). and the features they offer a user. Figure 1. Execution efficiency has always been a major concern of IR systems since most of them are interactive. The relative importance of these factors must be decided by the designers of the system.

4: IR Test Collections Collection Subject Documents Queries -----------------------------------------------------ADI CACM CISI CRAN LISA MED NLM NPL TIME Information Science Computer Science Library Science Aeronautics Library Science Medicine Medicine Electrical Engineering General Articles 82 3200 1460 1400 6004 1033 3078 11429 423 35 64 76 225 35 30 155 100 83 IR experiments using such small collections have been criticized as not being realistic. and presented a domain model of IR systems that describes their similarities and differences. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Evaluation techniques for IR systems were also briefly surveyed. The number of documents in test collections has tended to be small. might be used. Since real IR databases typically contain much larger collections of documents. delete. the generalizability of experiments using small test collections has been questioned. some in the megabyte to gigabyte range. typically a few hundred to a few thousand documents. Table 1. Table 1.Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE a user was twice as interested in recall as precision.ooks_Algorithms_Collection2ed/books/book5/chap01. A typical IR system must meet the following functional and nonfunctional requirements. and examine the retrieved documents. We have summarized the various approaches.. The next chapter is an introduction to data structures and algorithms. and retrieve relevant documents in response to queries interactively--often within 1 to 10 seconds.4 summarizes the test collections on this disk. Test collections are available on an optical disk (Fox 1990). An IR system will typically need to support large databases. It must allow a user to add..htm (10 of 11)7/3/2004 4:19:21 PM . 1. and change documents in the database.5 SUMMARY This chapter introduced and defined basic IR concepts. elaborated in subsequent chapters. taken by IR systems in providing these services. IR experiments often use test collections which consist of a document database and a set of queries for the data base for which relevance judgments are available. It must provide a way for users to search for documents by entering queries.

17(1). Go to Chapter 2 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. MCGILL 1983.. J. Virginia Disk One. A. TONG. New York: Holt. Special Issue on Knowledge Based Techniques for Information Retrieval. Algorithms in C. ed. and Winston. WEINBERGER. CROFT. PRIETO-DIAZ. K.: Addison-Wesley. and G. VAN RIJSBERGEN. SPARCK-JONES. SEDGEWICK. 1. S.. FOX." Computing Surveys. "Term Conflation for Information Retrieval. ed.Information Retrieval: CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE REFERENCES AHO. London: Butterworths. FRAKES. Reading. 1985. KERNIGHAN. FALOUTSOS. 109-145. "Retrieval Techniques. W. 1989. 1990. Reading. 1984.. Domain Analysis: Acquisition of Reusable Information for Software Construction. G. 1981.htm (11 of 11)7/3/2004 4:19:21 PM . Williams. London: Butterworths. C. Blacksburg: Virginia Polytechnic Institute and State University. BELKIN N.. ARANGO. M. B. B. 49-74. 1986. New York: McGraw-Hill. 1988. E. EARHART. and W. ed.. B. and M. R. Cambridge: Cambridge University Press..: Addison-Wesley. 1991.ooks_Algorithms_Collection2ed/books/book5/chap01. S." in Annual Review of Information Science and Technology. 1979. C." in Research and Development in Information Retrieval. 1987. Rinehart. Information Retrieval.. 1990. Information Retrieval Experiment. van Rijsbergen. Mass. J. The UNIX Programming Language. R. C. New York: Elsevier Science Publishers. and P. "Access Methods for Text. Mass. International Journal of Intelligent Systems. ed. R. An Introduction to Modern Information Retrieval. 4(3). New York: IEEE Press. SALTON. vol. The AWK Programming Language.

many contributions from theoretical computer science have practical and regular use in IR systems. de Ciencias de la Computación. Baeza-Yates Depto. 2.htm (1 of 15)7/3/2004 4:19:26 PM . and finite automata. 2. For good C or Pascal code we suggest the Handbook of Algorithms and Data Structures of Gonnet and Baeza-Yates (1991).1 INTRODUCTION Infomation retrieval (IR) is a multidisciplinary field. regular expressions.Books_Algorithms_Collection2ed/books/book5/chap02. We distinguish three main classes of algorithms and give examples of their use. and filtering algorithms. and digital trees.3 we have a look at the three classical foundations of structuring data in IR: search trees. and assumes some programming knowledge as well as some theoretical computer science background. These are retrieval. and we classify information retrieval related algorithms. Finite automata are file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. In section 2. such that word searching or Boolean expressions are particular cases of it. Chile Abstract In this chapter we review the main concepts and data structures used in information retrieval. regular expressions (as a general query language). Santiago.. Regular expressions provide a powerful query language. The presentation level is introductory. The first section covers some basic concepts: strings.4 we attempt to classify IR algorithms based on their actions.2 BASIC CONCEPTS We start by reviewing basic concepts related with text: strings. hashing. We do not include code bccause it is given in most standard textbooks. We give the main performance measures of each structure and the associated trade-offs. Casilla 2777. and finite automata (as the basic text processing machine). indexing. Universidad de Chile.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND ALGORITHMS RELATED TO INFORMATION RETRIEVAL Ricardo A. In this sense.. In section 2. and the simplest model of text is a single long string. In this chapter we study data structures and algorithms used in the implementation of IR systems. Strings appear everywhere.

we define L0 = { } and Li = LLi-1 for i . Clearly. A string over an alphabet is a finite length sequence of symbols from . The star or Kleene closure of L. s3) The two main distance functions are as follows: The Hamming distance is defined over strings of the same length.. tax) = 2.2 Similarity between Strings When manipulating strings. L*. The edit distance is defined as the minimal number of symbols that is necessary to insert.2. and s3. s3) d(s1. such that for any strings s1.. If x and y are strings. The length of a string x ( x ) is the number of symbols of x. then x is a prefix. d(s1. If L is a language. 2. union (+) and star or Kleene closure (*) (Hopcroft and Ullman (1979). and in different ways of text filtering and processing. if we do not know a priori a bound in the bound in the size of the alphabet.length(s2) . Otherwise. If = xyz is a string. For example.Books_Algorithms_Collection2ed/books/book5/chap02. We say that the alphabet is finite if there exists a . The plus or file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. 2. The language {xy x 1. Let L1 and L2 be two languages. The function d is defined as the number of symbols in the same position that are different (number of mismatches). we say that y is a subsequence. s1) = 0. A language over an alphabet is a set of strings over . s2) 0. s2) + d(s2.3 Regular Expressions We use the usual definition of regular expressions (RE for short) defined by the operations of concatenation.1 Strings We use to denote the alphabet (a set of symbols). xy denotes the concatenation of x and y. d(s1. is the language L1 and y L2} is called the concatenation of L1 and L2 and is denoted by L1 L2. d 2. that) = 2. For this purpose. and z a suffix of . For example. The empty string ( ) is the string with no symbols. several similarity measures have been defined. satisfies the following properties: d(s1. or substitute to transform a string s1 to s2. d(s1.htm (2 of 15)7/3/2004 4:19:26 PM . denoted by alphabet size. length(s1) . we need to know how similar are a pair of strings. If the letters do not have to be contiguous. Any contiguous sequence of letters y from a string is called a substring.2. delete.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND used for string searching (either by software or hardware). we say that the alphabet is arbitrary. d(text. s2. s2) (text. Each similarity model is defined by a distance function d.2.

+ r). am] to denote a range of symbols from . then p + q (union).Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND positive closure is defined by L+ = LL*. and p* (star) are regular expressions that denote L(p) L(q). where< > are characters in the OED text that denote tags (A for author. The regular expressions over and the languages that they denote (regular sets or regular languages) are defined recursively as follows: is a regular expression and denotes the empty set. All citations to an author with prefix Scot followed by at most 80 arbitrary characters then by works beginning with the prefix Kenilw or Discov: <A>Scot 80 <W>(Kenilw + Discov) (finite closure). then union. r? to denote zero or one occurrence of r (that is. then concatenation. For each symbol a in . (empty string) is a regular expression and denotes the set { }. r k to denote Examples: All the examples given here arise from the Oxford English Dictionary: 1.. To avoid unnecessary parentheses we adopt the convention that the star operator has the highest precedence. . For this we need an order in .. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. We use L(r) to represent the set of strings in the language denoted by the regular expression r. pq (concatenation). L(p)L(q). and L(p)*.Books_Algorithms_Collection2ed/books/book5/chap02.htm (3 of 15)7/3/2004 4:19:26 PM . W for work). r? = [a1 . respectively. We also use: to denote any symbol from (when the ambiguity is clearly resolvable by context). All operators are left associative. If p and q are regular expressions. a is a regular expression and denotes the set {a}.

F) (see Hopcroft and Ullman file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Q for the quotation itself. .4 Finite Automata A finite automaton is a mathematical model of a system.Books_Algorithms_Collection2ed/books/book5/chap02.. Sometimes. 4. is a finite input alphabet. and regular languages can be represented by regular expressions. For example. All first citations accredited to Shakespeare between 1610-11: <EQ>(<LQ>)?<Q><D> 161(0+1)</D> <A>Shak where EQ stands for the earliest quotation tag. Figure 2. we restrict the query to a subset of regular languages. Figure 2. Scott: <A>((Sirb)? W)?bScott b?</A> where b denotes a literal space.z]*</bl> 3. where Q is a finite set of states. We use regular languages as our query domain. All "bl" tags (lemma in bold) containing a single word consisting of lowercase alphabetical only: <bl>[a. and D for date.1: A finite automaton Formally.htm (4 of 15)7/3/2004 4:19:26 PM . All references to author W. where we only allow single strings as valid queries. q0. The automaton can be in any one of a finite number of states and is driven from state to state by a sequence of discrete inputs. LQ for quotation label. when searching in plain text.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND 2.. a finite automaton (FA) is defined by a 5-tuple (Q.2. we have the exact string matching problem. [1979]). 2. .1 depicts an automaton reading its input from a tape..

If q and reading symbol a enters state(s) (q. Figure 2. we say that the FA has accepted the string written on its input tape up to the last symbol read. That is. given a regular expression r. and the default transition from all the states to state 0 when there is no transition defined for the read symbol. A DFA is called minimal if it has the minimum possible number of states. and moves the reading head one position to the right. There is a simple algorithm that. we say that the FA is deterministic (DFA). a) describes the next state(s). a). for each state q and input symbol a. or is undefined. In one move. There are also algorithms to convert a NFA to a NFA without transitions (O(|r|2) states) and to a DFA (0(2|r|) states in the worst case). the FA in state (q. and given a DFA or NFA. There exists an O(| |n log n) algorithm to minimize a DFA with n states. Figure 2.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND q0 F Q is the initial state. we can express the language that it recognizes as RE. constructs a NFA that accepts L (r) in O (|r|) time and space. function is not defined for all possible symbols of for A finite automaton is called partial if the each state. All the transitions are shown with the exception of the transition from every state (with the exception of states 2 and 3) to state 1 upon reading a <. If (q. a) F. there is an implicit error state belonging to F for every undefined transition.. and is the (partial) transition function mapping Q X ( + { }) to zero or more elements of Q. In that case. there exists a FA that accepts L(r) for any regular expression r. In other words. Q is the set of final states.2 shows the DFA that searches an occurrence of the fourth query of the previous section in a text. a) has an unique value for every q and a. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. otherwise we say that it is nondeterministic (NFA). The languages accepted by finite automata (either DFAs or NFAs) are regular languages.htm (5 of 15)7/3/2004 4:19:26 PM .Books_Algorithms_Collection2ed/books/book5/chap02. A finite automaton starts in state q0 reading the input symbols from a tape. The double circled state is the final state of the DFA.2: DFA example for <A>((Sir b)? W)?bScott b? < / A>.. (q.

in search trees. we describe a special class of balanced multiway search trees called B-tree. using a table we have constant time per transition and thus O (n) searching time. If the alphabet is known and finite. Some examples of their use in subsequent chapters of this book are: Search trees: for optical disk files (Chapter 6).. and so on. In particular. hashing "randomizes" the data order. 2. However. On the other hand. but also as components in searching algorithms (especially digital trees). with the disadvantage that scanning in sequential order is not possible (for example. the searching time depends on how the transitions are implemented. Usually.Books_Algorithms_Collection2ed/books/book5/chap02. achieving constant time per transition on average. However. Binary search trees are adequate for main memory. string searching (Chapter 10).htm (6 of 15)7/3/2004 4:19:26 PM . prefix B-trees (Chapter 3). If the alphabet is not known in advance. range searches are expensive). We refer the reader to Gonnet and Baeza-Yates (1991) for search and update algorithms related to the data structures of this section. while in digital trees.3 DATA STRUCTURES In this section we cover three basic data structures used to organize data: search trees. In this case. 2. and hashing. and the left subtree stores all keys smaller that the parent key. because they are a well-known structure that can be used to implement static search tables. the digital (symbol) decomposition is used to direct the search. while the right subtree stores all keys larger than the parent key. We do not describe arrays. These three data structures differ on how a search is performed. optical disk file structures (Chapter 6). suffix trees (Chapter 5).Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND DFA will be used in this book as searching machines.3. They are used not only for storing text in secondary memory. digital trees. being able to search faster on average. stoplists (Chapter 7). for secondary memory. Hashing: hashing itself (Chapter 13). Digital trees: string searching (Chapter 10). because internal nodes are bigger.1 Search Trees The most well-known search tree is the binary search tree. stoplists (Chapter 7). multiway search trees are better. associated retrieval. signature files (Chapter 4). we use the complete value of a key to direct the search. Boolean operations (Chapters 12 and 15).. Another possibility would be to use a hashing table in each state. bit vectors for set manipulation. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Each internal node contains a key. we can use an ordered table in each state. suffix arrays (Chapter 5). Trees define a lexicographical order over the data. the searching time is O (n log m).

The algorithm is applied recursively. the expected storage utilization is ln 2 . Deletions are handled in a similar fashion. If there is not enough space in the corresponding leaf. the height of the tree increases by one.1 . using bucket size 4. Updates are done bottom-up. 66 percent minimal storage utilization is provided. Baeza-Yates 1989). Baeza-Yates and Larson 1989). the height of the tree is at most logm+1(n/b) + 2 where n is the number of keys. Some of them are: B*-trees: in case of overflow. Splits provides a minimal storage utilization of 50 percent.htm (7 of 15)7/3/2004 4:19:26 PM .69 (Yao 1979. up to the root. To improve storage utilization. This structure is called B+-tree. 66 percent minimal and 81 percent average storage utilization is achieved (Lomet 1987. several overflow techniques exist. and b is the number of records that can be stored in a leaf. we go down the tree choosing the appropriate branch at each step. and we promote a key to the previous level. In that case. Therefore. we search the insertion point. Using two bucket sizes of relative ratio 2/3.. a B-tree is used as an index. B-trees are mainly used as a primary key access method for large databases in secondary memory.3: A B+ -tree example (Di denotes the primary key i. avoiding a split. To search a given key. If ki is the i-th key of a given internal node. Figure 2. The number of disk accesses is equal to the height of the tree. and all the associated data are stored in the leaves or buckets.th child are smaller than ki. by merging nodes. then all keys in the i . An example of a B+-tree of order 2 is shown in Figure 2. On average. Usually. With this technique. plus its associated data).3. we split it. To insert a new record. This technique file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. if necessary.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND A B-tree of order m is defined as follows: The root has between 2 and 2m keys. while all the keys in the i-th child are bigger.. Knuth 1973).Books_Algorithms_Collection2ed/books/book5/chap02. a subset of the keys is shifted. while all other internal nodes have between m and 2m keys. In that case. All leaves are at the same depth. or split. a bucket is expanded (if possible). The main disadvantage is that updates are more expensive (Bayer and McCreight 1972. If an overflow occurs. Partial expansions: buckets of different sizes are used. we first see if neighboring nodes have space.

by computing a new index value.4: Insertion of a new key using double hashing. A hashing function is used to map a set of keys to slots in a hashing table. Adaptive splits: two bucket sizes of relative ratios 1/2 are used. The main problem of this schema is that a search may degenerate to a linear search. some kind of reorganization must be done. 2. for nonfull tables. Prefix B-trees (Bayer and Unterauer 1977). Figure 2. The average search time is constant.3. The most used technique in this class is double hashing. and the insertion of a key using the hashing function h (x) = x mod 13 (this is only an example. see Ullman (1972). ordered scanning or range searches are very expensive. Knuth 1973). the collided key is stored in an overflow area. Because hashing "randomizes" the location of keys. and Knott (1975). or not (unsuccessful case). There are two classes of collision resolution schemas: open addressing and overflow addressing. Thus. The main limitation of this technique is that when the table becomes full. splits are not symmetric (balanced). Hashing techniques mainly differ in how collisions are handled. A special kind of B-trees. we say that we have a collision. Hashing functions are designed to produce values uniformly distributed in the given range.. Figure 2. as is the case with words.1). If the hashing function gives the same slot for two different keys. 0 to m .2 Hashing A hashing function h (x) maps a key x to an integer in a given range (for example. More details on hashing can be file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. Knuth (1973). such that all key values with the same hashing value are linked together. This kind of B-tree is discussed in detail in Chapter 3. For a good discussion about choosing hashing functions.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND does not deteriorate update time. In open addressing (Peterson 1957). Searches follow the insertion path until the given key is found. the collided key is "rehashed" into the table. Guibas and Szemeredi 1978). supports efficiently variable length keys.4 shows a hashing table of size 13. The hashing value is also called a signature.htm (8 of 15)7/3/2004 4:19:26 PM . and we do not recommend using this hashing function!). and they depend on the insertion point. a sequential scan in lexicographical order is not possible.Books_Algorithms_Collection2ed/books/book5/chap02. This technique achieves 77 percent average storage utilization and is robust against nonuniform distributions (low variance) (Baeza-Yates 1990). In overflow addressing (Williams 1959. which uses a second hashing function (Bell and Kaman 1970. However.

and grows in a predetermined way (Litwin 1980. The average number of internal nodes is + O (n) (Knuth 1973). we have a lexicographically ordered tree. Larson 1980. For the case of textual databases.3. the height of a trie is logarithmic for any square-integrable probability distribution (Devroye 1982). we have for a binary trie containing n strings.. the substring's identity is represented by its starting position in the text. that string's identity is stored in an external node. . On average. To improve search time on B-trees. " after inserting all the substrings that start from positions 1 through 8. we have to mention the bounded disorder method (Litwin and Lomet 1987). The root of the trie uses the first character.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND found in Chapter 13. and linear hashing which uses an overflow area.htm (9 of 15)7/3/2004 4:19:26 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.5 shows a binary trie (binary alphabet) for the string "01100100010111 . If the remaining subtrie contains only one string. Tries were invented by de la Briandais (1959) and the name was suggested by Fredkin (1960). a special technique called signature files (Faloutsos 1987) is used most frequently. Figure 2. This technique is covered in detail in Chapter 4 of this book.) The height of a trie is the number of nodes in the longest path from the root to an external node. The average number of internal nodes inspected during a (un)successful search in a binary trie with n strings is log2n + O(1). This data structure is used in several algorithms. The main difference is that tables have to grow dynamically as the number of keys increases. 2. Larson and Kajla 1984). from information retrie val. and to allow range searches in hashing schemes. The main techniques are extendible hashing which uses hashing on two levels: a directory and a bucket level (Fagin et al. The length of any path from the root to an external node is bounded by the height of the trie. If the alphabet is ordered. the children of the root use the second character. several hybrid methods have been devised. and so on. .Books_Algorithms_Collection2ed/books/book5/chap02. One of the best indices for prefix searching is a binary digital tree or binary trie constructed from a set of substrings of the text. where B+-tree buckets are organized as hashing tables. (In this case.3 Digital Trees Efficient prefix searching can be done using indices. 1979). Between them. Tries are recursive tree structures that use the digital decomposition of strings to represent a set of strings and to direct the searching. Hashing schemes have also been used for secondary memory. For a random uniform distribution (Regnier 1981)..

This name is an acronym for "Practical Algorithm To Retrieve Information Coded In Alphanumerical. 2. for example. 2.1 Retrieval Algorithms The main class of algorithms in IR is retrieval algorithms. which are described below.ooks_Algorithms_Collection2ed/books/book5/chap02. As for tries. A variation of these are called position trees (Weiner 1973).6 shows the Patricia tree corresponding to the binary trie in Figure 2." A counter is kept in each node to indicate which is the next bit to inspect. Each internal node consists of a pair of pointers plus some counters. For n strings. Figure 2. Figure 2.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND A Patricia tree (Morrison 1968) is a trie with the additional constraint that single-descendant nodes are eliminated. that is.5.6: Patricia tree (internal node label indicates bit number).". and to draw a line between each type of application. However. Thus. time.5: Binary trie (external node label indicates position in the text) for the first eight suffixes in "01100100010111 . a Patricia tree is called a compact suffix tree. according to how much extra memory we need: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Figure 2. A trie built using the substrings (suffixes) of a string is also called suffix tree (McCreight [1976] or Aho et al.4. the space required is O(n). the It is possible to build the index in expected height of a Patricia tree is logarithmic (and at most the height of the binary trie).. Similarly..4 ALGORITHMS It is hard to classify IR algorithms. . where denotes the height of the tree. The reason that they cannot be considered as IR algorithms is because they are inherent to any computer application. The expected height of a Patricia tree is log2 n + o(log2 n) (Pittel 1986). to extract information from a textual database. . we can identify three main types of algorithms. [1974]). There are other algorithms used in IR that do not fall within our description.htm (10 of 15)7/3/2004 4:19:26 PM . We can distinguish two types of retrieval algorithms. such an index has n external nodes (the n positions of the text) and n -1 internal nodes. user interface algorithms.

by parallel machines. a regular expression q (the query). find a position m 0 such that t least m that fulfills this condition. the complexities of these problems are different. In general. and/or standardize it to simplify searching. We assume that is not a member of L(q). Formally. The index size is usually proportional to the database size. the answer is trivial. For example. and information (optionally) obtained by preprocessing the pattern and/or the text. and so on.2 Filtering Algorithms This class of algorithms is such that the text is the input and a processed or filtered version of the text is the output. etc. The location where an occurrence (or specifically the first. the first occurrence is defined as the *q *. Formally.htm (11 of 15)7/3/2004 4:19:26 PM . The efficiency of retrieval algorithms is very important. 2.. if t m q *. the running time is at least proportional to the size of the text. the number of all possible values of m in the previous category. 2. string searching (Chapter 10). for example.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND Sequential scanning of the text: extra memory is in the worst case a function of the query size. 3. This is a typical transformation in IR.4. and can be used to speed up the search. The number of occurrences of the pattern in the text. All the locations where the pattern occurs (the set of all possible values of m). the problem consists of finding whether t *q * (q for short) and obtaining some or all of the following information: 1..ooks_Algorithms_Collection2ed/books/book5/chap02. we can describe a generic searching problem as follows: Given a string t (the text). for example. If it is. because we expect them to solve on-line queries with a short answer time. for example to reduce the size of a text. This need has triggered the implementation of retrieval algorithms in many different ways: by hardware. The most common filtering/processing operations are: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.) of q exists. Indexed text: an "index" of the text is available. Algorithms to solve this problem are discussed in Chapter 10. Note that string matching is a particular case where q is a string. inverted files (Chapter 3) and signature files (Chapter 4). and the search time is sublinear on the size of the text. and not of the database size. Formally. These cases are explained in detail in Chapter 17 (algorithms by hardware) and Chapter 18 (parallel algorithms). the longest. On the other hand.

and so on. before indexing.7: Text preprocessing The preprocessing time needed to build the index is amortized by using it in searches. Usually. special symbols. as we have seen in the previous section. This is the topic of Chapter 8. based on different retrieval approaches. 1985). There are many classes of indices. we have inverted files (Chapter 3).ooks_Algorithms_Collection2ed/books/book5/chap02. if file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Numbers and dates transformed to a standard format (Gonnet 1987). and is based on finite automata theory. For example.4. Perhaps the main exceptions are clustered data structures (this kind of indexing is called clustering). signature files (Chapter 4). or uppercase letters. the text is filtered.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND Common words removed using a list of stopwords. it is not possible to search for common words. Figure 2. Automatic keyword extraction. Unfortunately.. Word stemming (removing suffixes and/or prefixes). 2. tries (Chapter 5). before consulting the database.3 Indexing Algorithms The usual meaning of indexing is to build a data structure that will allow quick searching of the text.7 shows the complete process for the text. Figure 2.htm (12 of 15)7/3/2004 4:19:26 PM . which is covered in Chapter 16. as we mentioned previously. and. these filtering operations may also have some disadvantages. Special symbols removed and sequences of multiple spaces reduced to one space. must be filtered as is the text. which represents all possible subwords of the text using a linear amount of space (Blumer et al. Spelling variants transformed using Soundex-like methods (Knuth 1973). nor to distinguish text fragments that have been mapped to the same internal form. Any query.. Almost all type of indices are based on some kind of tree or hashing. For example. Word ranking. This operation is discussed in Chapter 7. Uppercase letters transformed to lowercase letters. and the Direct Acyclic Word Graph (DAWG) of the text.

Dept. Reading. 1(3). 1989.. 16-28. in AFIPS Western JCC. KAMAN. and C. 367-71. A." in Extending Data Base Technology Conference (EDBT 90). 1. BLUMER. 1987. 2(1). STRONG. FAGIN. "Performance of B+-trees with Partial Expansions. 13(11). SEIFERAS. R.. pp. 1974.. DEVROYE. on Knowledge and Data Engineering. "An Adaptive Overflow Technique for the B-tree. and their building algorithms (some of them in parallel machines).htm (13 of 15)7/3/2004 4:19:26 PM . and H. LARSON." Theoretical Computer Science. 26(5). we add O (log n) preprocessing time to the total query time (that may also be logarithmic). BAEZA-YATES. 1977. and J. N. ULLMAN.ooks_Algorithms_Collection2ed/books/book5/chap02. 11-26. R. 1985. J. 1979. 295-98. EHRENFEUCHT. The Design and Analysis of Computer Algorithms. and E. F. "The Linear Quotient Hash Code. University of Waterloo." ACM TODS. HOPCROFT. Also as Research Report CS-86-67. and K. BAYER. "File Searching Using Variable Length Keys. A. Thanos and D. "Prefix B-trees. "Extendible Hashing--a Fast file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. "The Smallest Automaton Recognizing the Subwords of a Text. PIPPENGER. 1972. D.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND building the index requires O (n log n) time. R. Tsichritzis. of Computer Science." Acta Informatica. REFERENCES AHO. NIEVERGELT.-A. A." CACM. and J. University of Waterloo. R. 1989. R. 1982. "A Note on the Average Depth of Tries.. R. 1970. J. 1990. "Expected Behaviour of B+-Trees under Random Insertions. 675-77." IEEE Trans. 40. UNTERAUER. eds. C. 31-55. are covered in this book. J. 1959. Bancilhon. San Francisco. MCCREIGHT. CHEN. 28." Computing. 173-89. "Organization and Maintenance of Large Ordered Indexes. 439-72. pp..: Addison-Wesley. Also as Research Report CS-87-04. BELL. J.. R. we would expect to query the database at least O (n) times to amortize the preprocessing cost. L. and P. DE LA BRIANDAIS. In that case.."Acta Informatica. Springer Verlag Lecture Notes in Computer Science 416. 248-57. BAEZA-YATES.. 1986. BLUMER. Mass. HAUSSLER. BAYER. Many special indices.. Calif. M. BAEZA-YATES. Venice.

ULLMAN." Computer Journal. and J. vol. "The Analysis of Double Hashing. 1978. GONNET. BAEZA-YATES." CACM. FALOUTSOS. 1987.K. E. University of Maryland. pp. 4(3). Signature Files: An Integrated Access Method for Text and Attributes. 1986.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND Access Method for Dynamic Files. "Hashing Functions. LARSON. J. Handbook of Algorithms and Data Structures--In Pascal and C. KAJLA. 1991." CACM." JCSS. TR-86-06. 18(3)." in Third Annual Conference of the UW Centre for the New Oxford English Dictionary. "File Organization: Implementation of a Method Guaranteeing Retrieval in One Access. GONNET.. Montreal. Mass. Introduction to Automata Theory. 1987. and A. Reading. MORRlSON. 1987. W. Reading. pp. FREDKIN. 1960. 1987. W. 85-89." in VLDB.. and E." Technical Report CS-TR-1867. "PATRlClA-Practical Algorithm to Retrieve Information Coded in file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Wokingham. LITWIN. "Partial Expansions for File Organizations with an Index. 265-78.." JACM. HOPCROFT. 12: 65-84. 1973. LARSON." in VLDB. D. D. MCCREIGHT. pp. (2nd ed. Canada. G. 3. KNUTH. SZEMEREDI.: Addison-Wesley.: AddisonWesley.ooks_Algorithms_Collection2ed/books/book5/chap02. ACM TODS. IEEE Software. vol. KNOTT.: Addison-Wesley. and R. "Extracting Information from a Text Database: An Example with Dates and Numerical Data. 226-74. Wang Institute. C. 1976. L. Suitable for Optical Disk Storage. LOMET. 6. 3. 6. 1984. vol. "Linear Hashing: A New Tool for File and Table Addressing.). 1968. 670-77.. 1980. 4(2). 1975. Also as tech report..-A. 1979. G. 212-23. The Art of Computer Programming: Sorting and Searching. U. LITWlN. D. D. P. 16-24. 224-32. GUIBAS.-A. ACM TODS. 262-72.htm (14 of 15)7/3/2004 4:19:26 PM . "Trie Memory. E. D. and LOMET. 490-99. 23. 1980. Mass. 27(7). 16(2). Montreal. G. "A Space-Economical Suffix Tree Construction Algorithm. P. "A New Method for Fast Data Searches with Keys. 315-44. "Linear Hashing with Partial Expansions.. Waterloo.

13.. P." SIAM J. 1986. 130-46. "A Note on the Efficiency of Hashing Functions. "Linear Pattern Matching Algorithm. J. 8. WEINER. F. 64-66. 21-24. PITTEL. IBM J Res. 1959. Proc." JACM. YAO. ULLMAN. Appl. 14. 13955. "On the Average Height of Trees in Digital Search and Dynamic Hashing. vol. W. Prob. 1972. 1(4). 514-34. 36887." Adv.htm (15 of 15)7/3/2004 4:19:26 PM . 569-75. Development. "The Complexity of Pattern Matching for a Random String. 1981.. 2(6). Letters. REGNIER.. WILLIAMS. A. "Paths in a Random Digital Tree: Limiting Distributions. B." in FOCS. pp. "Addressing for Random-Access Storage. "Handling Identifiers as Internal Symbols in Language Processors. M. 15. 1957. 18. 19(3). PETERSON. 1973. Computing. Go to Chapter 3 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." Inf.Information Retrieval: CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND Alphanumeric." CACM." JACM. 1-11. 1979.ooks_Algorithms_Collection2ed/books/book5/chap02.

Blacksburg. and indices based on hashing are covered in Chapter 13 and Chapter 4 (signature files). Clustered file structures are covered in Chapter 16. This is the kind of file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Information Retrieval: CHAPTER 3: INVERTED FILES CHAPTER 3: INVERTED FILES Donna Harman National Institute of Standards and Technology. VA 24061-0106 Abstract This chapter presents a survey of the various structures (techniques) that can be used in building inverted files. Blacksburg. Department of Computer Science. The concept of the inverted file type of index is as follows. with optional relevance weights associated with each keyword (attribute).1 INTRODUCTION Three of the most commonly used file structures for information retrieval can be classified as lexicographical indices (indices that are sorted). with each keyword having links to the documents containing that keyword (see Figure 3. Lee Virginia Polytechnic Institute and State University. Each document is assigned a list of keywords or attributes. MD 20899 Edward Fox Virginia Polytechnic Institute and State University. with a second type of lexicographical index. the Patricia (PAT) tree. The chapter ends with two modifications to this basic method that are effective for large data collections..1) . Casilla 2777.htm (1 of 12)7/3/2004 4:19:30 PM . and gives the details for producing an inverted file using sorted arrays. Baeza-Yates Depto. clustered file structures. Universidad de Chile. de Ciencias de la Computación. Department of Computer Science. One type of lexicographical index.. Santiago. the inverted file. Assume a set of documents. VA 24061-0106 R.Books_Algorithms_Collection2ed/books/book5/chap03. Chile W. discussed in Chapter 5. An inverted file is then the sorted list (or index) of keywords (attributes). is presented in this chapter. and indices based on hashing. 3. Gaithersburg.

for example. a list of character sequences to be indexed (or not indexed). The final section summarizes the chapter. Examples of these restrictions are: a controlled vocabulary which is the collection of keywords that will be indexed. and may have signficant impact on what terms are indexed. This problem is further discussed in Chapter 7.htm (2 of 12)7/3/2004 4:19:30 PM .. and the fourth section shows two variations on this implementation. The next section presents a survey of the various implementation structures for inverted files. The result of a search on an inverted file is a set of records (or pointers to records). and a need to update that index as the data set changes. The penalty paid for this efficiency is the need to store a data structure that ranges from 10 percent to 100 percent or more of the size of the text itself. character sequences consisting of all numerics are often not indexed. The use of an inverted file improves search efficiency by several orders of magnitude. and then a possible search on that index for a particular attribute value. and hence are not searchable.Information Retrieval: CHAPTER 3: INVERTED FILES index found in most commercial library systems. a necessity for very large text files. a set of rules that decide the beginning of a word or a piece of text that is indexable. This Chapter is organized as follows.) that for reasons of volume or precision and recall will not be included in the index.Books_Algorithms_Collection2ed/books/book5/chap03.1: An inverted file implemented using a sorted array Usually there are some restrictions imposed on these indices and consequently on later searches. A search in an inverted file is the composition of two searching algorithms. or some standard prefixes. prepositions. Words in the text that are not in the vocabulary will not be indexed. It should be noted that the restrictions that determine what is to be indexed are critical to later search effectiveness and therefore these rules should be carefully constructed and evaluated.. These rules deal with the treatment of spaces. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Figure 3. a list of stopwords (articles. In large text databases. which returns an index. not all character sequences are indexed. a search for a keyword (attribute). and hence are not searchable. The third section covers the complete implementation of an algorithm for building an inverted file that is stored as a sorted array. etc. punctuation marks. one that uses no sorting (and hence needs little working storage) and one that increases efficiency by making extensive use of primary memory.

1 The Sorted Array An inverted file implemented as a sorted array structure stores the list of keywords in a sorted array. More details of B-trees can be found in Chapter 2. In this case. (For more on hashing methods. such as all documents having keywords that start with "comput. The last level or leaf level stores the keywords themselves.. B-trees. the prefix B-tree.2 STRUCTURES USED IN INVERTED FILES There are several structures that can be used in implementing inverted files: sorted arrays. The prefix B-tree method breaks down if there are many words with the same (long) prefix." Only these three structures will be further discussed in this chapter. sorted arrays are easy to implement and are reasonably fast. Each key is the shortest word (in length) that distinguishes the keys stored in the next level. B-trees use more space. see Chapters 4 and 13.2. On the other hand. and various hashing structures. However.3. Because the internal node keys and their lengths depend on the set of keywords. the order (size) of each node of the prefix B-tree is variable. The main disadvantage of this approach is that updating the index (for example appending a new keyword) is expensive. the details of creating a sorted array inverted file are given in section 3. Each internal node has a variable number of keys. The key does not need to be a prefix of an actual term in the index.2). although large secondary-storage-based systems will often adapt the array (and its search) to the characteristics of their secondary storage. updates are much easier and the search file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.2 B-trees Another implementation structure for an inverted file is a B-tree. (For this reason.2: A prefix B-tree Compared with sorted arrays. Updates are done similarly to those for a B-tree to maintain a balanced tree.Books_Algorithms_Collection2ed/books/book5/chap03. including the number of documents associated with each keyword and a link to the documents containing that keyword..htm (3 of 12)7/3/2004 4:19:30 PM .) 3. common prefixes should be further divided to avoid wasting space. A special case of the B-tree.) 3. and also in a recent paper (Cutting and Pedersen 1990) on efficient inverted files for dynamic data (data that is heavily updated). Figure 3.Information Retrieval: CHAPTER 3: INVERTED FILES 3. uses prefixes of words as primary keys in a B-tree index (Bayer and Unterauer 1977) and is particularly suitable for storage of textual indices.2. This array is commonly searched using a standard binary search. along with their associated data (see Figure 3. and can efficiently support range queries. or combinations of these structures. tries. The first three of these structures are sorted (lexicographically) indices.

3: Overall schematic of sorted array inverted file creation Creating the initial word list requires several different operations. especially if secondary storage is used for the inverted file (instead of memory). This is usually done by sorting on the word (or stem). however. with duplicates retained (see Figure 3. First. and the stemming operation is described in Chapter 8. 3. An optional third step is the postprocessing of these inverted files. The word list resulting from the parsing operation (typically stored as a disk file) is then inverted. and if it can be considered a noncommon word. and therefore readers are referred to Knuth (1973) and Cutting and Pedersen (1990) for details of implementation of B-trees. Methods that do not use sorting are given in section file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. A special trie structure.htm (4 of 12)7/3/2004 4:19:30 PM .Books_Algorithms_Collection2ed/books/book5/chap03. or for reorganizing or compressing the files. this sort can be time consuming for large data sets (on the order of n log n). First.3 Tries Inverted files can also be implemented using a trie structure (see Chapter 2 for more on tries). 3. Each word is then checked against a stoplist of common words.3.2. is especially useful in information retrieval and is described in detail in Chapter 5. One way to handle this problem is to break the data sets into smaller pieces. This is usually the most time consuming and storage consuming operation in indexing. process each piece. may be passed through a stemming algorithm.3 BUILDING AN INVERTED FILE USING A SORTED ARRAY The production of sorted array inverted files can be divided into two or three sequential steps as shown in Figure 3. The parsing operation and the use of a stoplist are described in Chapter 7. Second. Even with the use of high-speed sorting utilities. with a list of all locations attached to each term). and to Bayer and Unterauer (1977) for details of implementation of prefix Btrees. such as for adding term weights. and then correctly merge the results. the Patricia (PAT) tree. An additional source for tested and optimized code for B-trees and tries is Gonnet and Baeza-Yates (1991). this list must then be inverted.. The implementation of inverted files using B-trees is more complex than using sorted arrays.Information Retrieval: CHAPTER 3: INVERTED FILES time is generally faster. Figure 3.4). The resultant stem is then recorded in the word-within-location list. the individual words must be recognized from the text. the input text must be parsed into a list of words along with their location in the text. This structure uses the digital decomposition of the set of keywords to represent those keywords. from a list of terms in location order to a list of terms ordered for use in searching (sorted into alphabetical order.

After sorting.3. Work using large data sets (Harman and Candela 1990) showed that for a file of 2. Figure 3.1. including some numbers) and an average of 88 postings per record. statistics about that term such as number of postings.Books_Algorithms_Collection2ed/books/book5/chap03.4: Inversion of word list Although an inverted file could be used directly by the search routine.4 MODIFICATIONS TO THE BASIC TECHNIQUE Two different techniques are presented as improvements on the basic inverted file creation discussed in section 3.304 records had dictionaries on the order of 250. typically inverted files store field locations and possibly even word location. (A system not using within-document frequencies can just sort with duplicates removed.4..000 lines (250. This format is based on the search methods and the (optional) weighting methods used. A common search technique is to use a binary search routine on the file to locate the query words. the dictionary used in the binary search has only one "line" per unique term. Figure 3.5 illustrates the conceptual form of the necessary files. which contains the record numbers (plus other necessary location information) and the (optional) weights for all occurrences of the term. Inverted files for ranking retrieval systems (see Chapter 14) usually store only record locations and term weights or frequencies.) Note that although only record numbers are shown as locations in Figure 3. and (possibly) frequencies is usually split into two pieces. The second piece is the postings file itself.000 unique terms. In this manner.000 postings for a term.4.. Figure 3. and a pointer to the location of the postings file for that term. The first technique is for working with very large data sets using secondary storage. the actual form depends on the details of the search routine and on the hardware being used. A larger data set of 38. 3.5: Dictionary and postings file from the last example 3. and for this reason the single file shown containing the terms.Information Retrieval: CHAPTER 3: INVERTED FILES 3. These additional locations are needed for field and proximity searching in Boolean operations and cause higher inverted file storage overhead than if only record location was needed.htm (5 of 12)7/3/2004 4:19:30 PM . the duplicates are merged to produce within-document frequency statistics. From these numbers it is clear that efficient storage structures for both the binary search and the reading of the postings are critical. This implies that the file to be searched should be as short as possible.1 Producing an Inverted File for Large Data Sets without Sorting file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The second technique uses multiple memory loads for inverting files. it is usually processed into an improved final format. there were 5. locations.4.123 unique terms with an average of 14 postings/term and a maximum of over 2.653 records. The first piece is the dictionary containing the term.

This is important in indexing large data sets where data is usually processed from multiple separate files over a short period of time. Each element in the linked postings file consists of a record number (the location of a given term). the term frequency in that record. and a pointer to the next element in the linked list for that given term.) The new indexing method (Harman and Candela 1990) is a two-step process that does not need the middle sorting step.6). However.. this method could be used for creating and storing supplemental indices for use between updates to the primary index. and the files are easily accessed by following the links. each term is consecutively read from the binary tree (this automatically puts the list in alphabetical order). with the frequencies being used to calculate the term weights (if desired). no storage is wasted. one variable length linked list for each term.. and the second step adds the term weights to that file and reorganizes the file for maximum efficiency (see Figure 3. and the resulting files properly merged. The data contained in each binary tree node is the current number of term postings and the storage location of the postings list for that term. along with related data. the following technique may be considerably faster. The use of the binary tree and linked postings list could be considered as an updatable inverted file. and either is added to the tree. By storing the postings in a single file. The binary tree and linked postings lists are saved for use by the term weighting routine (step two). it is looked up in the binary tree. The last step writes the record file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. see the earlier discussion of B-trees for better ways of producing updatable inverted files. Figure 3. with the lists stored in one large file.6: Flowchart of new indexing method The creation of the initial inverted file avoids the use of an explicit sort by using a right-threaded binary tree (Knuth 1973). (For another approach to sorting large amounts of data. A new sequentially stored postings file is allocated. To do this. the entire list does not need to be read for each addition. As the location of both the head and tail of each linked list is stored in the binary tree. As each term is identified by the text parsing program. plus the intermediate files involved in the internal sort. Note that both the binary tree and the linked postings list are capable of further growth. The linked postings list is then traversed. but only once for use in creating the final postings file (step two).Books_Algorithms_Collection2ed/books/book5/chap03. along with its related data. with two elements per posting. Most computers cannot sort the very large disk files needed to hold the initial word list within a reasonable time frame. For small data sets. Whereas the data set could be broken into smaller pieces for processing. see Chapter 5. This routine walks the binary tree and the linked postings list to create an alphabetical term list (dictionary) and a sequentially stored postings file. The postings are stored as multiple linked lists.Information Retrieval: CHAPTER 3: INVERTED FILES Indexing large data sets using the basic inverted file method presents several problems. and do not have the amount of storage necessary to hold a sorted and unsorted version of that word list. Although these structures are not as efficient to search. this technique carries a significant overhead and therefore should not be used. The first step produces the initial inverted file.htm (6 of 12)7/3/2004 4:19:30 PM . or causes tree data to be updated.

However. The small size of the final index is caused by storing only the record identification number as location.6 megabyte) database because of its additional processing overhead. and around 14 percent of the input text size for the larger databases. These sequentially stored postings files could not be created in step one because the number of postings is unknown at that point in processing. the list is inverted by sorting.70 6 70 163 old 0.4 4 52 112 new 0. The new method takes more time for the very small (1.50 10.5 137 313 old 4. The working storage (the storage needed to build the index) for the new indexing method is not much larger than the size of the final index itself..1 gives some statistics showing the differences between an older indexing scheme and the new indexing schemes. and substantially smaller than the size of the input text. however. each posting contains both a record id number and the term's weight in that record. an amount of storage beyond the capacity of many environments.htm (7 of 12)7/3/2004 4:19:30 PM . The final index files therefore consist of the same dictionary and sequential postings file as for the basic inverted file described in section 3.0 132 --new 0. The older method contains a sort (not optimal) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.3 in which records are parsed into a list of words within record locations. Table 3.25 8 --new 0. not inverted file order.. As the size of the database increases.Information Retrieval: CHAPTER 3: INVERTED FILES numbers and corresponding term weights to the newly created sequential postings file.Books_Algorithms_Collection2ed/books/book5/chap03. the processing time has an n log n relationship to the size of the database. only 8 percent of the input text size for the 50 megabyte database. and input order is text order.4 4 52 112 Note that the size of the final index is small.6 50 359 806 0. and finally the term weights are added. This size remains constant when using the new indexing method as the format of the final indexing files is unchanged. As this index was built for a ranking retrieval system (see Chapter 14). the amount of working storage needed by the older indexing method would have been approximately 933 megabytes for the 359 megabyte database. The old indexing scheme refers to the indexing method discussed in section 3. and over 2 gigabytes for the 806 megabyte database.3. Table 3.1: Indexing Statistics Text Size (megabytes) Indexing Time (hours) Working Storage (megabytes) Index Storage (megabytes) ---------------------------------------------------------old 1.

making processing of the very large databases likely to have taken longer using this method. The document numbers appear in the left-hand column and the concept numbers of the words in each document appear in the right. and document numbers are sorted within collection. Fox. 3. and mainframes may have more than 100 megabytes of memory. so that concept numbers are sorted within document numbers.htm (8 of 12)7/3/2004 4:19:30 PM . These costs are further compounded if memory is not used. The overall scheme can be seen in Figure 3.. one concept number for each unique word in the collection (i. except that the words are represented by concept numbers. The input to FAST-INV is a document vector file containing the concept vectors for each document in the collection to be indexed. The FAST-INV algorithm follows these two principles. 250.7. The second principle is crucial since with large files it is very expensive to use polynomial or even n log n sorting algorithms. A sample document vector file can be seen in Figure 3. Figure 3. This technique takes advantage of two principles: the large primary memories available on today's computers and the inherent order of the input data. Virginia Tech). This is similar to the initial word list shown in Figure 3.Books_Algorithms_Collection2ed/books/book5/chap03. using primary memory in a close to optimal fashion. the overall cost will be minimized.2 A Fast Inversion Algorithm The second technique to produce a sorted array inverted file is a fast inversion algorithm called FASTINV (Copyright © Edward A.8. The first principle is important since personal computers with more than 1 megabyte of primary memory are common. The following summary of this technique is adapted from a technical report by Fox and Lee (1991).4. if they can be split into memory loads that can be rapidly processed and then combined.8: Sample document vector Preparation file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo..000 unique concept numbers)..e.7: Overall scheme of FAST-INV Figure 3.Information Retrieval: CHAPTER 3: INVERTED FILES which is n log n (best case) to n squared (worst case). Whay C.4 for the basic method. since then the cost is for disk operations. Lee. This is necessary for FASTINV to work correctly. Even if databases are on the order of 1 gigabyte. Note however that the document vector file is in sorted order.000 unique words implies 250.hand column. and processing the data in three passes.

so several primary memory loads will be needed to process the document data.count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load.Books_Algorithms_Collection2ed/books/book5/chap03. inverted there. approximately.Information Retrieval: CHAPTER 3: INVERTED FILES In order to better explain the FAST-INV algorithm. of size HCN. Note that the test for room in a given load enforces the constraint that data for a load will fit into available memory. However. otherwise update information for the current load table entry. Allocate an array. Use the just constructed con_entries_cnt to create a disk version of CONPTR.htm (9 of 12)7/3/2004 4:19:30 PM . 4.. in bytes In the first pass. Initialize the load table. Specifically: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. it is appropriate to see if this data can be somehow transformed beforehand into j parts such that: L / j < M. preparation involves the following: 1. 2. con_entries_cnt. For each <con#. initialized to zero. and the output to be simply appended to the (partially built) final inverted file. It is assumed that M >> HCN. then create an entry in the load table and initialize the next load entry. the CONPTR file has been built and the load table needed in later steps of the algorithm has been constructed. Because entries in the document vector file will already be grouped by concept number. so these two files can be built in primary memory. so that each part will fit into primary memory HCN / j concepts. 3. 5. After one pass through the input. are associated with each part This allows each of the j parts to be read into primary memory. the entire document vector file can be read and two new files produced: a file of concept postings/pointers (CONPTR) and a load table. with those concepts in ascending order. it is assumed that M < L.con#> entry in the document vector file: increment con_entries_cnt[con#] . Specifically. For each <doc#. some definitions are needed. HCN = highest concept number in dictionary L = number of document/concept (or concept/document) pairs in the collection M = available primary memory size.

with the desired separation into loads. Inverting each load When a load is to be processed. the CONPTR data allows the input to be directly mapped to the output..9 illustrates the FAST-INV processing using sample data. The document vector input files are read through once to produce the concept postings/pointers file (stored on disk as CONPTR) and the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. it is separated into parts for each range of concepts in the load table.e.e. and then that offset is incremented.. With I/O buffering. which is currently used.. At the end of the input load the newly constructed output is appended to the inverted file.ooks_Algorithms_Collection2ed/books/book5/chap03. since sizes of each load are known. the expense of this operation is proportional to the size of the files. One approach.. This has the advantage of not requiring additional storage space (though that can be obviated through the use of magnetic tapes). and essentially costs the same as copying the file. but has the disadvantage of requiring expensive disk I/O. end concept . Thus. An output array of size equal to the input document vector file subset is needed. As each document vector is processed.Information Retrieval: CHAPTER 3: INVERTED FILES Let LL = length of current load (i. This can easily be done using the load table. and those parts are appended to the end of the corresponding section of the output document collection file. the appropriate section of the CONPTR file is needed. There are two approaches to handling the multiplicity of loads. in one pass through the input. without any sorting. number of concept/weight pairs) S = spread of concept numbers in the current load (i. As each document vector is read. is to make a pass through the document vector file to obtain the input for each load. An example Figure 3. The second approach is to build a new copy of the document vector collection. the offset (previously recorded in CONPTR) for a given concept is used to place the corresponding document/weight entry.start concept + 1 ) 8 bytes = space needed for each concept/weight pair 4 bytes = space needed for each concept to store count of postings for it Then the constraint that must be met for another concept to be added to the current load is 8 * LL + 4 * S < M Splitting document vector file The load table indicates the range of concepts that should be processed for each primary memory load.htm (10 of 12)7/3/2004 4:19:30 PM .

The appropriate section of the CONPTR file is used so that inversion is simply a copying of data to the correct place. Three loads will be needed. and HCN is 14. it can be seen that FAST-INV is a linear algorithm in the size of the input file.. can be file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Table 3. including the various loading experiments made during testing. There are three parts.684 document/8. for concepts in ranges 1-4. There are 10 distinct concepts. rather than a sort.9: A FAST-INV example The second phase of processing uses the load table to split the input document vector files and create the document vector loads files. it should be obvious that FAST-INV should perform well in comparison to other inversion methods. The final phase of processing involves inverting each part of the document vectors loads file. and 12-14. Processing in primary memory is limited to scans through the input. corresponding to the three loads. with appropriate (inexpensive) computation required on each entry. Thus. It can be seen that the document vectors in each load are shortened since only those concepts in the allowable range for the load are included.)[:sec. a variety of tests were made. and appending the result to the inverted file.2 summarizes the results for indexing the 12. using primary memory. 5-11.ooks_Algorithms_Collection2ed/books/book5/chap03.Information Retrieval: CHAPTER 3: INVERTED FILES load table. Figure 3. To demonstrate this fact. such as those used in other systems such as SIRE and SMART. The input disk files must be read three times.] for SIRE.68 megabyte INSPEC collection. Table 3.htm (11 of 12)7/3/2004 4:19:30 PM ..2: Real Time Requirements (min. and written twice (using the second splitting scheme). Performance results Based on the discussion above. SMART Method Comments Indexing Inversion -------------------------------------------------------------SIRE SMART FAST-INV Dictionary built during inversion Dictionary built during indexing Dictionary built during indexing 35 49 49 72 11 1:14 More details of these results. L.

DONALD E." Journal of the American Society for Information Science. 1990.. BAEZA-YATES." ACM Transactions on Database Systems. "FAST-INV: A Fast Algorithm for Building Large Inverted Files. LEE. UNTERAUER.: Addison-Wesley. KNUTH. D. 1990.Information Retrieval: CHAPTER 3: INVERTED FILES found in the full technical report (Fox and Lee 1991).). H. HARMAN." Technical Report TR-91-10.. 1991. "Retrieving Records from a Gigabyte of Text on a Minicomputer using Statistical Ranking. E. GONNET. "Prefix B-Trees.ooks_Algorithms_Collection2ed/books/book5/chap03. Mass. and G. The Art of Computer Programming. 41 (8).: Addison-Wesley. Go to Chapter 4 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. CUTTING. G. Reading. "Optimizations for Dynamic Inverted Index Maintenance.. 1977. Two modifications that are effective for indexing large data collections were also presented. Blacksburg. C. Reading. March. 24061-0106. and R. 11-26. and K.htm (12 of 12)7/3/2004 4:19:30 PM . D. R. Mass. FOX." Paper presented at the 13th International Conference on Research and Development in Information Retrieval.. 3.5 SUMMARY This Chapter has presented a survey of inverted file structures and detailed the actual building of a basic sorted array type of inverted file.. Belgium. 1991. PEDERSEN. Brussels. REFERENCES BAYER. A. and W. 1991. and J.. VPI&SU Department of Computer Science. Va. Handbook of Algorithms and Data Structures (2nd ed. 581-89. 2 (1). 1973.. CANDELA.

For example.1 INTRODUCTION Text retrieval methods have attracted much interest recently (Christodoulakis et al. Electronic encyclopedias (Ewing et al. 1986). Gonnet 1982. and it gives a list of applications as well as commercial or university prototypes that use the signature approach. 4. together with the relative advantages and drawbacks. Stanfill and Kahle 1986. which contain descriptions of products in natural language. 1986. 1986. Gonnet and Tompa 1987. There are numerous applications involving storage and retrieval of textual data: Electronic office filing (Tsichritzis and Christodoulakis 1983.. it describes the main representatives of each class. Tsichritzis and Christodoulakis 1983). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Information Retrieval: CHAPTER 4: SIGNATURE FILES CHAPTER 4: SIGNATURE FILES Christos Faloutsos University of Maryland. Automated law (Hollaar 1979) and patent offices (Hollaar et al. Patent and Trademark Office has been examining electronic storage and retrieval of the recent patents on a system of 200 optical disks.htm (1 of 22)7/3/2004 4:19:35 PM . A similar project is carried out at the National Library of Medicine (Thoma et al. It describes the main idea behind the signature approach and its advantages over other text retrieval methods.S. Electronic storage and retrieval of articles from newspapers and magazines.S. Gonnet and Tompa 1987).Books_Algorithms_Collection2ed/books/book5/chap04. Nofel 1986). The U. Library of Congress has been pursuing the "Optical Disk Pilot Program" (Price 1984. Computerized libraries.. College Park and UMIACS Abstract This chapter presents a survey and discussion on signature-based text retrieval methods. 1985). the U. it provides a classification of the signature methods that have appeared in the literature. 1983). Consumers' databases. where the goal is to digitize the documents and store them on an optical disk. Christodoulakis et al.

Christodoulakis 1987). where the images are manually annotated (Christodoulakis et al. which discards many of the nonqualifying items. as opposed to 50-300% that inversion requires [Haskin 1981]).. they can handle insertions more easily than inversion. they require a modest space overhead (typically 10-15% [Christodoulakis and Faloutsos 1984]. The signature file approach works as follows: The documents are stored sequentially in the "text file. Methods requiring "append-only" insertions have the following advantages: (a) increased concurrency during insertions (the readers may continue consulting the old portion of index structure. 1986). which constitute an excellent archival medium (Fujitani 1984. precisely because their response time is linear on the number of items N in the database. depending on the individual method).Books_Algorithms_Collection2ed/books/book5/chap04. Thus. the signature file is scanned and many nonqualifying documents are discarded. On the other hand. Compared to inversion. and signature files. because they need "append-only" operations--no reorganization or rewriting of any portion of the signatures. Signature files are based on the idea of the inexact filter: They provide a quick test. The rest are either checked (so that the "false drops" are discarded) or they are returned to the user as they are. signature files have been used in the following environments: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The main operational characteristics of all the above applications are the following two: Text databases are traditionally large. Text databases have archival nature: there are insertions in them. on which we shall focus next. A brief. some additional items ("false hits" or "false drops") may also pass it accidentally. Test retrieval methods form the following large classes (Faloutsos 1985): Full text scanning. The qualifying items definitely pass the test. but almost never deletions and updates. Searching image databases. moreover.htm (2 of 22)7/3/2004 4:19:35 PM .Information Retrieval: CHAPTER 4: SIGNATURE FILES Indexing of software components to enhance reusabililty (Standish 1984). while an insertion takes place) (b) these methods work well on Write-Once-Read-Many (WORM) optical disks.. signature files may be slow for large databases. Searching databases with descriptions of DNA molecules (Lipman and Pearson 1985)." When a query arrives. inversion. if scripted animation is used (Lewis 1989). A similar approach could be used to search a database with animations." Their "signatures" (hash-coded bit patterns) are stored in the "signature file. qualitative comparison of the signature-based methods versus their competitors is as follows: The signature-based methods are much faster than full text scanning (1 or 2 orders of magnitude faster.

Block signatures are concatenated. In section 4. For performance reasons. distributed text db This chapter is organized as follows: In section 4.4 we discuss methods based on vertical partitioning of the signature file.7 we give the conclusions. WORMs 3.) Each such word yields a "word signature. parallel machines 4. In section 4.6 we present methods that are based on horizontal partitioning of the signature file. 4. with m bits set to "1"..1).htm (3 of 22)7/3/2004 4:19:35 PM .. to form the document signature. F and m are design parameters. each document is divided into "logical blocks.Books_Algorithms_Collection2ed/books/book5/chap04. more details are in Faloutsos (1985). pieces of text that contain a constant number D of distinct." which is a bit pattern of size F. (To improve the space overhead. Searching for a word is handled by creating the signature of the word and by examining each block signature for "1" 's in those bit positions that the signature of the search word has a "1". noncommon words.5 we discuss methods that use both vertical partitioning and compression.2 we present the basic concepts in signature files and superimposed coding. medium size db 2. Word Signature ----------------------------------free text 001 000 000 010 110 101 010 001 ----------------------------------block signature 001 010 111 011 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. In section 4. The word signatures are OR'ed together to form the block signature. while the rest are "0" (see Figure 4. which will be explained later. a stoplist of common words is maintainted." that is. A brief description of the method follows. The m bit positions to be set to "1" by each word are decided by hash functions. In section 4.3 we discuss methods based on compression.2 BASIC CONCEPTS Signature files typically use superimposed coding (Moders 1949) to create the signature of a document. PC-based. In section 4.Information Retrieval: CHAPTER 4: SIGNATURE FILES 1.

An important concept in signature files is the false drop probability Fd. m=4 bits per word.1:. In the case of a word that has l triplets.. and it would always create a false drop. the word is allowed to set l (nondistinct) bits. it gives the probability that the signature test will fail. In order to allow searching for parts of words. creating a "false alarm" (or "false hit" or "false drop"). Previous analysis showed that... overlapping triplets (e.htm (4 of 22)7/3/2004 4:19:35 PM . It is assumed that each logical block consists of D=2 words only. a long document would have a signature full of "l" 's. initialized with a numerical encoding of the word. the value of m that minimizes the false drop probability is such that each row of the matrix contains "1" 's with probability 50 percent. Table 4. we have (Stiassny 1960): Fd = 2m F1n2 = mD This is the reason that documents have to be divided into logical blocks: Without logical blocks.Information Retrieval: CHAPTER 4: SIGNATURE FILES Figure 4. Intuitively. Notice that the signature test never gives a false dismissal.Books_Algorithms_Collection2ed/books/book5/chap04. Under such an optimal design. is the probability that a block signature seems to qualify. If l < m. "fre". for example. DEFINITION: False drop probability. Fd. the additional bits are set using a random number generator. "ree". "fr". Expressed mathematically: Fd = Prob{signature qualifies/block does not} The signature file is an F N binary matrix. considering the triplet as a base-26 number. with l > m. The signature size F is 12 bits.1: Illustration of the superimposed coding method. "ee" for the word "free"). Symbols and definitions Symbol Definition -------------------------------------------------------F signature size in bits file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.g. the following method has been suggested (Faloutsos and Christodoulakis 1984): Each word is divided into successive. To avoid unnecessary complications. for a given value of F. for the rest of the discussion we assume that all the documents span exactly one logical block. Each such triplet is hashed to a bit position by applying a hashing function on a numerical encoding of the triplet. given that the block does not actually qualify.

it may be slow for large databases. it can be compressed.Information Retrieval: CHAPTER 4: SIGNATURE FILES m D Fd number of bits per word number of distinct noncommon words per document false drop probability The most straightforward way to store the signature matrix is to store the rows sequentially.3. Compression: If the signature matrix is deliberately sparse. For the rest of this work.. to the beginning of the documents). The methods we shall examine form the classes shown in Figure 4." with pointers to the beginnings of the logical blocks (or.Books_Algorithms_Collection2ed/books/book5/chap04. the main representatives. Figure 4. Sequential storage of the signature matrix file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The main ideas behind all these methods are the following: 1. the response time on single word queries. trying to improve the response time of SSF. Figure 4. Horizontal partitioning: Grouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search. the above method will be called SSF. trading off space or insertion simplicity for speed. alternatively. Vertical partitioning: Storing the signature matrix columnwise improves the response time on the expense of insertion time. we need the so-called "pointer file. the performance on insertion. 3. Many methods have been suggested.2 illustrates the file structure used: In addition to the text file and the signature file. as well as whether the insertion maintains the "append-only" property.htm (5 of 22)7/3/2004 4:19:35 PM .2: File structure for SSF Although SSF has been used as is.. For each of these classes we shall describe the main idea. for Sequential Signature File. discussing mainly the storage overhead. 2. and the available performance results.

.Information Retrieval: CHAPTER 4: SIGNATURE FILES without compression sequential signature files (SSF) with compression bit-block compression (BC) variable bit-block compression (VBC) Vertical partitioning without compression bit-sliced signature files (BSSF.Books_Algorithms_Collection2ed/books/book5/chap04.. B'SSF)) frame sliced (FSSF) generalized frame-sliced (GFSSF) with compression compressed bit slices (CBS) doubly compressed bit slices (DCBS) no-false-drop method (NFD) Horizontal partitioning data independent partitioning Gustafson's method partitioned signature files data dependent partitioning 2-level signature files file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.htm (6 of 22)7/3/2004 4:19:35 PM .

using the approach of "infinite Huffman codes" (Golomb 1966. The idea in these methods is that we use a (large) bit vector of B bits and we hash each word into one (or perhaps more. the resulting bit vector is sparse and can be compressed. Gallager and van Voorhis 1975).3 COMPRESSION In this section we examine a family of methods suggested in Faloutsos and Christodoulakis (1987).Books_Algorithms_Collection2ed/books/book5/chap04. say n) bit position(s). To determine whether a bit is "1" in the sparse vector.5). it was shown that the resulting methods achieve better false drop probability than SSF for the same space overhead. searching becomes slow. which are set to "1" (see Figure 4.3: Classification of the signature-based methods 4. the best value for m is 1. However.. data base management system 0000 0000 0000 0000 0000 0001 1000 0000 0000 0000 0000 0000 0010 0000 0000 0000 0000 0000 0000 1000 ---------------------------------------block signature 0000 1001 0000 0010 1000 Figure 4. and then compress them before storing them sequentially. whenever compression is applied. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.htm (7 of 22)7/3/2004 4:19:35 PM .4). the encoded lengths of all the preceding intervals (runs) have to be decoded and summed (see Figure 4. Also.Information Retrieval: CHAPTER 4: SIGNATURE FILES S-trees Figure 4. These methods create sparse document signatures on purpose. Analysis in that paper showed that. With B = 20 and n = 1 bit per word. The spacewise best compression method is based on run-length encoding (McIlroy 1982). The resulting bit vector will be sparse and therefore it can be compressed.4: Illustration of the compression-based methods..

It consists of s . It contains the offsets of the "1"s from the beginning of the bit-block (1 gb bits for each "1".. Figure 4.htm (8 of 22)7/3/2004 4:19:35 PM .2) in Faloutsos and Christodoulakis (1987) to achieve good compression. The size of the bit-blocks is chosen according to Eq. In the latter case.7 illustrates the way to store parts of a document signature: the first parts of all the bit-block signatures are file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. For each bit-block we create a signature. which is of variable length and consists of at most three parts (see Figure 4. Part III. It is one bit long and it indicates whether there are any "l"s in the bit-block (1) or the bit-block is empty (0).Books_Algorithms_Collection2ed/books/book5/chap04. [See Eq. where b is the bit-block size).4.4. The notation [x] stands for the encoded value of number x 4. Figure 4.6: Illustration of the BC method with bit-block size b = 4.. the sparse vector is divided into groups of consecutive bits (bit-blocks).1 Bit-block Compression (BC) This method accelerates the search by sacrificing some space.1 "1"s and a terminating zero.6 illustrates how the BC method compresses the sparse vector of Figure 4. The compression method is based on bit-blocks.1. and was called BC (for bit-Block Compression). the bit-block signature stops here.6): Part I.3. However this representation is simple and it seems to give results close to the optimal. 4. This is not the optimal way to record the number of "1"s. each bit-block is encoded individually. (A2.Information Retrieval: CHAPTER 4: SIGNATURE FILES Figure 4.5: Compression using run-length encoding. (4.1. b <--> sparse vector 0000 1001 0000 0010 1000 ------------------------------------------Part I Part II Part III 0 1 10 0011 0 1 0 10 1 0 00 Figure 4. To speed up the searching. It indicates the number s of "1"s in the bit-block. Part II. compared to the run-length encoding technique.5) in Faloutsos and Christodoulakis (1987)].

arithmetic examples in the same paper indicate the advantages of the modified method. Thus. fewer bit-blocks.). 4. while the lower row to a message with large W.. the two methods are almost as easy as the SSF.7: BC method--Storing the signature by concatenating the parts.Books_Algorithms_Collection2ed/books/book5/chap04.3 Performance With respect to space overhead. Analysis in Faloutsos and Christodoulakis (1987) shows how to choose the optimal size b of the bitblocks for a document with W (distinct) words. shorter Part I (the size of Part I is the number of bit-blocks). Their response time is slightly less than SSF. shorter Part II (its size is W) and fewer but larger offsets in Part III (the size of each offset is log bopt bits). VBC achieves significant savings even on main-memory operations.3. they require a few additional CPU cycles to do the compression. 4. thus simplifying the resolution of complex queries: There is no need to "remember" whether some of the terms of the query have appeared in one of the previous logical blocks of the message under inspection. the two methods (BC and VBC) require less space than SSF for the same false drop probability. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. then the second parts. The required main-memory operations are more complicated (decompression. etc. Figure 4. With respect to insertions. This method was called VBC (for Variable bit-Block Compression). due to the decreased I/0 requirements.3. the upper row has a larger value of bopt. Figure 4. The idea is to use a different value for the bit-block size bopt for each message.2 Variable Bit-block Compression (VBC) The BC method was slightly modified to become insensitive to changes in the number of words D per block.Information Retrieval: CHAPTER 4: SIGNATURE FILES stored consecutively.8 illustrates an example layout of the signatures in the VBC method. The upper row corresponds to a small message with small W. 0 1 0 1 1 | 10 0 0 | 00 11 10 00 Figure 4. and so on. This is desirable because the need to split messages in logical blocks is eliminated. but they are probably not the bottleneck. according to the number W of bits set to "1" in the sparse vector.htm (9 of 22)7/3/2004 4:19:35 PM .8: An example layout of the message signatures in the VBC method.. Vertical lines indicate the part boundaries. The size of the sparse vector B is the same for all messages.

4.1 Bit-Sliced Signature Files (BSSF) The bit-sliced design is illustrated in Figure 4." The method will be called BSSF. Sector size = 512 bytes.ooks_Algorithms_Collection2ed/books/book5/chap04. with "1" 's at the positions of the qualifying logical blocks. from Faloutsos and Christodoulakis (1987). we will dump them on the optical disk. which will be referred to as "bit-files. Faloutsos and Chan 1988) or in a "frame-sliced" form (Lin and Faloutsos 1988).4 VERTICAL PARTITIONING The idea behind the vertical partitioning is to avoid bringing useless portions of the document signature in main memory. for "Bit-Sliced Signature Files.Information Retrieval: CHAPTER 4: SIGNATURE FILES Figure 4." Figure 4.10. one per each bit position. the proposed method is applicable on WORM optical disks. Searching for a single word requires the retrieval of m bit vectors (instead of all of the F bit vectors) which are subsequently ANDed together. The resulting bit vector has N bits. 50 percent savings in false drops for documents 4. as a function of the space overhead Ov.4. The size of each bit file can be predicted (Faloutsos and Chan 1988) and the appropriate space can be preallocated on the disk. Latency time (avg) = 8. we have to use a magnetic disk. Analytical results. 1350 Series Seek time (avg) = 28 ms. As mentioned in the introduction.. Figure 4. thus. commercial optical disks do not allow a single bit to be written. when they become full.htm (10 of 22)7/3/2004 4:19:35 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..33 ms. An insertion of a new logical block requires F disk accesses. this can be achieved by storing the signature file in a bit-sliced form (Roberts 1979.10: Transposed bit matrix To allow insertions. Micropolis. that will hold the last page of each file.11 illustrates the proposed file structure. but no rewriting! Thus. the BC method achieves with D = 40 vocabulary words each.9: Comparison of Fd of BC (dotted line) against SSF (solid line). one for each bit-file. The following data are taken from product specification: Winchester. Transfer rate = 10 Mbits/sec. As shown in Figure 4.9. we propose using F different files.

12 gives an example for this method. 4.htm (11 of 22)7/3/2004 4:19:35 PM . s = 6. one of the k frames will be chosen by a hash function. A typical value of m is of the order of 10 (Christodoulakis and Faloutsos 1984). This implies 10 random disk accesses on a single word query. Seek time beyond current band = 200 ms. in order to maintain the same false drop probability. The text file is omitted. More specifically. Transfer rate = 3. signature Signature 000000 010110 010110 110010 000000 110010 Figure 4.11: File structure for Bit-Sliced Signature Files. using another hash function. the number of random disk accesses decreases. Figure 4.83 Mbits/sec. Sector size = 1K bytes. Figure 4. F = 12. The word text is hashed into the first frame and also sets 3 bits there. the document signatures have to be longer. these bit files are stored together and can be retrieved with few random disk accesses. m = 3. s.2 B'SSF. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 4: SIGNATURE FILES Alcatel Thompson Gigadisk GM 1001 (WORM) Seek time within +/. a Faster Version of BSSF The traditional design of BSSF suggests chosing the optimal value for m to be such that the document signatures contain "l"s by 50 percent. since they involve movement of the disk arm.4. The main motivation for this organization is that random disk accesses are more expensive than sequential ones. the word sets m bits (not necessarily distinct) in that frame. F. the method works as follows: The document signature (F bits long) is divided into k frames of s consecutive bits each. Latency (avg) = 27 ms.4. For each word in the document. In Lin and Faloutsos (1988) it is suggested to use a smaller than optimal value of m.ooks_Algorithms_Collection2ed/books/book5/chap04.. The drawback is that. k. m are design parameters.12: D = 2 words. The word free is hashed into the second frame and sets 3 bits there.100 tracks = 35 ms. k = 2. Word free text doc.Then.3 Frame-Sliced Signature File The idea behind this method (Lin and Faloutsos 1988) is to force each word to hash into bit positions that are close to each other in the document signature. thus.. 4.

Information Retrieval: CHAPTER 4: SIGNATURE FILES The signature matrix is stored framewise. Notice that BSSF. The document signature is the OR-ing of all the word signatures of all the words in that document. This method will be referred to as Frame-Sliced Signature file (FSSF). Each frame will be kept in consecutive disk blocks. and FSSF. it becomes the SSF method (the document signature is broken down to one frame only). each word selects only one frame and sets m bit positions in that frame. B'SSF. which are all its special cases. only one random disk access is required. it reduces to the FSSF method. Only one frame has to be retrieved for a single word query. n frames have to be scanned for an n word query. and SSF are actually special cases of GFSSF: When k = F.. BSSF. It is assumed that the transfer time for a page Ttrans = 1 msec and the combined seek and latency time Tseek is Tseek = 40 msec file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.4 The Generalized Frame-Sliced Signature File (GFSSF) In FSSF. Notice that GFSSF is faster than BSSF..13 plots the theoretically expected performance of GFSSF. A more general approach is to select n distinct frames and set m bits (not necessarily distinct) in each frame to generate the word signature. that is. This method is called Generalized Frame-Sliced Signature File (GFSSF. 4. Insertion will be much faster than BSSF since we need only access k frames instead of F bit-slices.ooks_Algorithms_Collection2ed/books/book5/chap04. it reduces to the BSSF or B'SSF method. 4. Analysis in the above paper gives formulas for the false drop probability and the expected response time for GFSSF and the rest of the methods. n = 1. Figure 4. B'SSF. B'SSF. we expect that a careful choice of the parameters will give a method that is better (whatever the criterion is) than any of its special cases.4.htm (12 of 22)7/3/2004 4:19:35 PM . Lin and Faloutsos 1988).5 Performance Since GFSSF is a generalized model. and FSSF.4. At most. When k = 1. FSSF. n = m. When n = 1.

. we shall use S to denote the size of a signature.14 illustrate a sparse bit matrix. As a bit file grows. one for each bit slice.htm (13 of 22)7/3/2004 4:19:35 PM . The text file was 2. Obviously. If we force m = l.8Mb database showed good agreement between the theoretical and experimental values for the false drop probability. Analytical results on a 2.Information Retrieval: CHAPTER 4: SIGNATURE FILES Figure 4.8Mb long. The insertion of a logical block requires too many disk accesses (namely. Therefore.ooks_Algorithms_Collection2ed/books/book5/chap04. The average response time ("real time") was 420 ms for FSSF with s = 63. On searching. more buckets are allocated to it on demand. Figure 4. The maximum relative error was 16 percent. to store it in a bit-sliced form. 2. The easiest way to compress each bit file is to store the positions of the "1" 's. when the load was light. to highlight the similarity of these methods to inversion using hash tables. which is a design parameter. The methods in this class are closely related to inversion with a hash table. then F has to be increased. Ov = 18 percent. FSSF and GFSSF.. subject to statistical variations. exactly because each word signature has m bits set to "1". For the next three methods. F. These buckets are linked together with pointers. m = 3. and compress each bit slice by storing the position of the "1"s in the slice. B'SSF. in order to maintain the same false drop probability (see the formulas in Faloutsos and Chan [1988]. The experiments on the 2. space overhead: a comparison between BSSF.000).1 Compressed Bit Slices (CBS) Although the bit-sliced method is much faster than SSF on retrieval. 1 Kb. 4. we store them in buckets of size Bp. The corresponding bit matrix and bit files will be sparse and they can be compressed. we also need a directory (hash table) with S pointers. there may be room for two improvements: 1. Notice the following: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. the size of each bit file is unpredictable now.13: Response time vs. However. which is typically 6001. Each document contained D = 58 The experiments were conducted on a SUN 3/50 with a local disk. 4. each search word requires the retrieval of m bit files.5. with average document size distinct noncommon words. and 480 ms for GFSSF with s = 15. m = 8. n = 3.8Mb database. and Ov = 18 percent. under UNIX. The search time could be improved if m was forced to be "1". the system was implemented in C.5 VERTICAL PARTITIONING AND COMPRESSION The idea in all the methods in this class (Faloutsos and Chan 1988) is to create a very sparse signature matrix.

The method is similar to CBS. of size Bp bytes (Bp is a design parameter). the hash table should be sparse.Information Retrieval: CHAPTER 4: SIGNATURE FILES 1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The method is similar to hashing.15: Illustration of CBS Searching is done by hashing a word to obtain the postings bucket address. as well as an extra pointer. if necessary. 4.5. and gives an example. 2. The set of all the compressed bit files will be called "leve1 1" or "postings file. The directory (hash table) is sparse. The postings file consists of postings buckets. we can store a pointer to the document in the text file. Since the hash table is sparse.14: Sparse bit matrix Thus. a postings file and the text file as in Figure 4. and that it appears in the document starting at the 1145-th byte of the text file. The actual word is stored nowhere. there will be few collisions..2 Doubly Compressed Bit Slices (DCBS) The motivation behind this method is to try to compress the sparse directory of CBS. the compressed bit files will contain pointers to the appropriate documents (or logical blocks). The file structure we propose consists of a hash table. Thus.ooks_Algorithms_Collection2ed/books/book5/chap04. will be retrieved. which returns values in the range (O. Each such bucket contains pointers to the documents in the text file. Figure 4.htm (14 of 22)7/3/2004 4:19:35 PM . we save space and maintain a simple file structure..16. To reduce the false drops. This bucket. Traditional hashing schemes require loads of 80-90 percent. an intermediate file. Figure 4. Figure 4. b. Instead of storing the position of each "1" in a (compressed) bit file. assuming that the word "base" hashes to the 30-th position (h("base") = 30).15 illustrates the proposed file structure." to agree with the terminology of inverted files (Salton and McGill 1983). The pointer file can be eliminated. as well as its overflow buckets. to point to an overflow postings bucket. There is no need to split documents into logical blocks any more. The differences are the following: a. to obtain the pointers to the relevant documents. It uses a hashing function h1().

without storing the actual words in the index. each record of the intermediate file will have the format (hashcode. which returns bit strings that are h bits long. The difference is that DCBS makes an effort to distinguish among synonyms. The pointer ptr is the head of a linked list of postings buckets.ooks_Algorithms_Collection2ed/books/book5/chap04.16 illustrates an example. Searching for the word "base" is handled as follows: Step 1 h1("base") = 30: The 30-th pointer of the directory will be followed. Figure 4. These hash codes are stored in the "intermediate file. Each such bucket contains records of the form (hashcode. hl("base") = 30. to retrieve the qualifying (actually or falsely) documents.17: Illustration of NFD This way each word can be completely distinguished from its synonyms. Step 2 h2("base") = (011)2: The records in the above intermediate buckets will be examined. Insertion is omitted for brevity." which consists of buckets of Bi bytes (design parameter). ptr). where ptr-to-word is a pointer to the word in the text file.. by using a second hashing function h2(). Step 3 The pointers of the above postings buckets will be followed.17 for an illustration.htm (15 of 22)7/3/2004 4:19:35 PM .. and h2("base") = (011) 2. Figure 4. using only h bits for the hash file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 4: SIGNATURE FILES (S-1)) and determines the slot in the directory.16: Illustration of DCBS Figure 4.5. The idea is to modify the intermediate file of the DCBS. If a matching hash code is found (at most one will exist!). the corresponding pointer is followed. where the word "base" appears in the document that starts at the 1145th byte of the text file. ptr-to-word). The corresponding chain of intermediate buckets will be examined. to retrieve the chain of postings buckets.3 No False Drops Method (NFD) This method avoids false drops completely. and store a pointer to the word in the text file. See Figure 4. ptr. Specifically. The example also assumes that h = 3 bits. 4.

in the form of a hashing function h(S). The advantages of storing ptr-to-word instead of storing the actual word are two: (1) space is saved (a word from the dictionary is 8 characters long (Peterson 1980).g.1 Data Independent Case Gustafson's method The earliest approach was proposed by Gustafson (1971). where S is a document signature (data independent case). Experiments on the 2. they introduce 20-25 percent space overhead. the groups can be determined on the fly. The signature of a keyword is a string of 16 bits. and they still require append-only operations on insertion. with p = 3 bytes per pointer. For example.8Mb database showed that the model is accurate. the corresponding ptr-to-word is followed. Searching is done in a similar way with DCBS. a B-tree--data dependent case). Consider a hashing function h that hashes a keyword w to a number h(w) in the range 0-15. and (2) the records of the intermediate file have fixed length. Thus. 4. partitioning the signature matrix horizontally.6. say six attributes each. 4. all of which are zero except for the bit at position h(w). The grouping criterion can be decided beforehand. and triangles to NFD. Figure 4.. circles to DCBS.6 HORIZONTAL PARTITIONING The motivation behind all these methods is to avoid the sequential scanning of the signature file (or its bit-slices). Squares correspond to the CBS method. to avoid synonyms completely. there is no need for a word delimiter and there is no danger for a word to cross bucket boundaries. using a hierarchical structure (e. Alternatively.ooks_Algorithms_Collection2ed/books/book5/chap04. records can be documents and attributes can be keywords describing the document.. The final conclusion is that these methods are fast.18 plots the theoretical performance of the methods (search time as a function of the overhead).5. an analytical model is developed for the performance of each of the above methods.18: Total disk accesses on successful search versus space overhead.htm (16 of 22)7/3/2004 4:19:35 PM .4 Performance In Faloutsos and Chan (1988). in order to achieve better than O(N) search time. Analytical results for the 2. Figure 4. usually) for the ptr-to-word. they group the signatures into sets.Information Retrieval: CHAPTER 4: SIGNATURE FILES code and p (=4 bytes. requiring few disk accesses. Suppose that we have records with. whenever a matching hash code is found in Step 2. Thus.8 Mb database. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The only difference is that. 4.

Then the function C(p1. 3.. Using a hash table with 8. each such signature corresponds to one block (group) of bibliographic records.ooks_Algorithms_Collection2ed/books/book5/chap04. we can choose the first 20 bits of a signature as its key and all signatures with the same key will be grouped into a so-called "module. Gustafson's method suffers from some practical problems: 1. depending on the number of bits being specified in the query signature. then either we must have a huge hash table or usual queries (involving 3-4 keywords) will touch a large portion of the database. as in the SSF method. .k bits are set by some random method. additional 6 .008 possible distinct record signatures (where C(m.Information Retrieval: CHAPTER 4: SIGNATURE FILES The record signature is created by superimposing the corresponding keyword signatures.htm (17 of 22)7/3/2004 4:19:35 PM . 6) maps each distinct record signature to a number in the range 0-8. we can map each record signature to one such slot as follows: Let p1 < p2 < .003 slots of the hash table. and so on. The interesting point of the method is that the extent of the search decreases quickly (almost exponentially) with the number of terms in the (conjunctive) query.2 Data Dependent Case Two-level signature files Sacks-Davis and his colleagues (1983. Their documents are bibliographic records of variable length. Its performance deteriorates as the file grows.6. The first level of signatures consists of document signatures that are stored sequentially. 2) + . 4.001 slots. Partitioned signature files Lee and Leng (1989) proposed a family of methods that can be applied for longer documents. we will first examine its signature key and look for the corresponding modules. Queries other than conjunctive ones are handled with difficulty. They report 15 to 85 percent speedups over SSF. . If the number of keywords per document is large. They suggested using a portion of a document signature as a signature key to partition the signature file.5) = 3. . 2. 4) = 1. Single word queries touch C(15.007. + C(p6. .6) = 8.." When a query signature arrives. there are comb (16. then scan all the signatures within those modules that have been selected. For example. Thus. 1) + C(p2. and is created by file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.008 slots. The second level consists of "block signatures". < p6 the positions where the "1"s occur in the record signature. If k < 6 bits are set in a record signature.n) denotes the combinations of m choose n items). twoword queries touch C(14. Although elegant. 1987) suggested using two levels of signatures.

htm (18 of 22)7/3/2004 4:19:35 PM . because it may contain the desired terms. "data and retrieval"): A block may result in an unsuccessful block match. The second level is stored in a bit-sliced form. The method requires small space overhead.. Medium-size databases. The leaf of an S-tree consists of k "similar" (i. which is fast but requires expensive insertions and needs significant space overhead. uses the SSF method. 1988). Another problem is that higher level nodes may contain keys that have many 1's and thus become useless. 4.7 DISCUSSION Signature files provide a space-time trade-off between full text scanning. the father node is changed appropriately to reflect the new situation. with small Hamming distance) document signatures along with the document identifiers. The BSSF method required 1-5 seconds for the same situation.Information Retrieval: CHAPTER 4: SIGNATURE FILES superimposing the signatures of all the words in this block. Recursively we construct directories on lower level directories until we reach the root..g. signature-based methods have been applied in the following environments: 1. Splits may propagate upward until reaching the root. the response time on queries is difficult to estimate analytically. due to its simplicity. low space overhead. A subtle problem arises when multiterm conjunctive queries are asked. and inversion. "Hyperties" (Lee et al. but not within the same record.e. but the append-only property is lost. Each level has its own hashing functions that map words to bit positions. when 1 record matched the query. for queries with low selectivities) on databases of 200 Kb (after file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. and then concentrating on these portions of the first-level signature file that seem promising.3 seconds for single word queries. The authors propose a variety of clever ways to minimize these block matches. The insertion requires a few disk accesses (proportional to the height of the tree at worst). Searching is performed by scanning the block signatures first.. which is slow but requires no overheads. The S-tree is kept balanced in a similar manner as a B-trees: when a leaf node overflows it is split in two groups of "similar" signatures. Analysis on a database with N 106 records (with 128 bytes per record on the average) reported response times as low as 0..ooks_Algorithms_Collection2ed/books/book5/chap04. S-tree Deppisch (1986) proposed a B-tree like structure to facilitate fast access to the records (which are signatures) in a signature file. a hypertext package that is commercially available on IBM PCs. (e. ignoring the record boundaries. which serves as a directory for the leaves. and satisfactory search time (3-10 seconds. Thus. The OR-ing or these k document signatures forms the "key" of an entry in an upper level node.

Ho. "A Signature Access Method for the Starbust Database System.. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." ACM TOOIS. they can provide fast responses against databases that are 10 times larger than the available main memory storage. CHRISTODOULAKIS. CHRISTODOULAKIS. M. Aug." Proc. THEODORIDOU. the MULTOS project (Croft and Savino 1988) funded by the ESPRIT project of the European Community. REFERENCES CHANG. "Design Considerations for a Message File Server. Netherlands. and M. 12(2). 22-25.. signature files can easily benefit from parallelism. like the Office Filing Project (Tsichritzis et al. 1984. VLDB Conference.. S. THEODORIDOU. 4 (4). CHRISTODOULAKIS. 1986. and A. This situation is probably true for message files in Office Automation. it is not worth building and maintaining a B-tree inverted index. "Analysis of Retrieval Performance for Records and Objects Using Optical Disk Technology.Information Retrieval: CHAPTER 4: SIGNATURE FILES compression). F." IEEE Trans. In this case. signature files provide a low-cost alternative. data. Information Extraction and Document Formation in MINOS: A Model and a System. keeping copies of signature files of all the sites can save remote-logins. and H. 1987. 3. PATHRIA. Another commercial product (Kimbrell 1988) is also reported to use signature files for text retrieval. Due to the append-only insertion. F. S. and indices) fits in floppy diskettes. and C. PAPA. Thanks to the low overhead. 6. In this case. 1986." ACM TODS. on Software Engineering. FALOUTSOS. W. 201-10. Due to the linear searching. 145-53. J. 1983) at the University of Toronto. M.. the whole package (software. In distributed environments. 137-69.ooks_Algorithms_Collection2ed/books/book5/chap04. SCHEK. for a modest space overhead. SE-10 (2). S.. Ho." Proc. pp. "Multimedia Document Presentation. CHRISTODOULAKIS. S. MINOS (Christodoulakis et al. Chang and Schek (1989) recommend signature files as a simple and flexible access method for records and "long fields" (=text and complex objects) in IBM's STAR-BUST extensible DBMS. 2. 1986) at the University of Waterloo. "The Multimedia Object Presentation Manager in MINOS: A Symmetric Approach. 4. 1989. Stanfill and Kahle (1986) used signature files on the Connection Machine (Hillis 1985). 5. several prototype systems use the signature approach. ACM SIGMOD. that avoids scanning the full text. Databases with low query frequencies. Keeping the whole signature file in core. Amsterdam.htm (19 of 22)7/3/2004 4:19:35 PM . W. signature-based methods can be used on WORM optical disks.

and D. L. 5 (3). G. University of Waterloo. 399-401. 1984." IEEE Trans. CHAN. W. and S." IEEE Trans. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 163-74. 27 (6). C. GUSTAFSON. of the Thirteenth Int. Journal of Man-Machine Studies. 1986. 6 (1). GONNET. pp. (1) 29-45. FALOUTSOS. S. Sept. 1966. 1988. D. on Office Informations Systems (TOOIS). G. S. of the Symposium on Information Storage and Retrieval. 237-57. on VLDB. 1988. C.. W.ooks_Algorithms_Collection2ed/books/book5/chap04.htm (20 of 22)7/3/2004 4:19:35 PM . "Run Length Encodings. FALOUTSOS. 1984. J. on Information Theory." Tech Report CS-82-09. 17 (1). GOLOMB. and S. EWING. and B. Conf. "Optimal Source Codes for Geometrically Distributed Integer Alphabets. Calif. DEPPISCH. MEHRABANZAD. 77-87." ACM Computing Surveys. Italy. FALOUTSOS. 1971." CACM. April. 546-54. GALLAGER. "Access Methods for Text.. 1975." ACM Trans. Proc. CHRISTODOULAKIS.. A. 24. GONNET. "Laser Optical Disk: The Coming Revolution in On-Line Storage. "Fast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison. 8-10. TOMPA. August.. pp." Proc.." ACM TOOIS. OSTROFF. U." pp. SHNEIDERMAN. 1985. and R. 339-346. 2 (4). on Very Large Data Bases. September 1-4. S. pp. 267-88. 228-30. IT-12. IT-21. C. W. on Office Information Systems. FALOUTSOS. SAVINO. CHRISTODOULAKIS. "Elements of the Randomized Combinatorial File Structure. 1982. "Description and Performance Analysis of Signature File Methods. VAN VOORHIS. Long Beach. B. H. "S-tree: A Dynamic Balanced Signature Index for Office Retrieval." ACM Trans.. Brighton. C. "Signature Files: An Access Method for Documents and its Analytical Performance Evaluation. and F. C.Information Retrieval: CHAPTER 4: SIGNATURE FILES CROFT. R. 42-62. Pisa. University of Maryland. of ACM "Research and Development in Information Retrieval. "Mind Your Grammar: A New Approach to Modelling Text. R." Int. 280-93. England. on Information Theory.. "An Experimental Comparison of a Mouse and Arrow-jump Keys for an Interactive Encyclopedia. H. SHECK. 1987. FUJITANI. 1986. and P. 1987.." ACM SIGIR. G. "Implementing Ranking Strategies Using Text Signatures. 49-74. 14th International Conf." Proc." Proc. "Unstructured Data Bases.

FALOUTSOS. "Computer Programs for Detecting and Correcting Spelling Errors. E." Master's Thesis. L. "Incorporating String Search in a Hypertext System: User Interface and Physical Design Issues.S. on Communications." Science." CACM. PETERSON. EMRATH. PRICE. P.: MIT Press. "Partitioned Signature File: Designs and Performance Evaluation. L. 1985. Englewood Cliffs.." IEEE Trans. PEARSON. 1979. of Computer Science. "Special-Purpose Processors for Text Retrieval. 1988. A. 1982.. W. 1985. 84-88. Zator Co. PLAISANT. MOOERS. COM-30. SHNEIDERMAN. MCILROY. Z. P. Full-Text Information-Retrieval System. J. HASKIN. 13 (5) 297-312. Dept. D." in Advanced Database Machine Architecture. LEE. thesis. LIPMAN. FALOUTSOS. LEE. 1988.." ACM Trans. Cambridge. C.htm (21 of 22)7/3/2004 4:19:35 PM . A. N.. 227. "Application of Random Codes to the Gathering of Statistical Information. 1983. 7 (2)." CS-TR-2146 and UMI-ACS-TR88-88. 158-80. R.ooks_Algorithms_Collection2ed/books/book5/chap04. J. and C. 1949. March. J. C.. Working paper. R. American Association for the Advancement of Science. L. Dept. "Development of a Spelling List. F.. C. 23 (12). HILLIS." Dept. 1980. "Animated Images for a Multimedia Database. "Frame Sliced Signature Files. Mass. "The Optical Disk Pilot Project At the Library of Congress. of Computer Science. pp. and B. January 1948. SMITH. "Searching for Text? Send an N-gram!" Byte. 1989. and R. ed. L. Hsiao. D. University of Maryland. "40 Million Hits on Optical Disk. D. CHOW. 1629. 40-50. on Information Systems (TOIS). 1435-41. and C. E. H. 676-87..-W. R. LEWIS.Information Retrieval: CHAPTER 4: SIGNATURE FILES HASKIN. J.: Prentice Hall. L. L. Cambridge. 12 (3). R. and W. HOLLAAR. "Architecture and Operation of a Large. NOFEL." Bulletin 31. (1). Based on M. MIT." Videodisc and Optical file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Mass. HOLLAAR." IEEE Computer Magazine. 91-99. K. March 22.. 1989. LENG." Database Engineering. P. D. 1984. KIMBRELL. LIN. A. 1986. 1988. College Park.J. 1981. "Text Retrieval Computers. University of Maryland. M." Modern Office Technology. K. of Computer Science. 256-99. 4 (1). D. The Connection Machine. "Rapid and Sensitive Protein Similarity Searches. University of Maryland.

" American Documentation. RAMAMOHANARAO. SACKS-DAVIS. FALOUTSOS. and M. SUTHASINEKUL. J. ROBERTS. and K. WOO. WALKER. SE-10 (5). R.. Italy. 424-32. D. STANDISH.. 1983. IEEE. on Office Information Systems. CHRISTODOULAKIS. 1984. 1985.Information Retrieval: CHAPTER 4: SIGNATURE FILES Disk. "Message Files. A. 155-69. 1983. 88-98. C. "A Prototype System for the Electronic Storage and Retrieval of Document Images. October-November. 1986. F. and C.htm (22 of 22)7/3/2004 4:19:35 PM .ooks_Algorithms_Collection2ed/books/book5/chap04.. 4 (6). RAMAMOHANARAO. LEE. RASHIDIAN. C. THOMA.. "An Essay on Software Reuse. 9th International Conference on VLDB." ACM Trans.. S. "Multikey Access Methods Based on Superimposed Coding Techniques" ACM Trans... 67 (12). 1987. Go to Chapter 5 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. S. A. T. SALTON. and M. S. STIASSNY. ECONOMOPOULOS. 29 (12). 1624-42. Florence. New York: McGraw-Hill. TSICHRITZIS. MCGILL. A. Introduction to Modern Information Retrieval. 8 (4). STANFILL. 12 (4). 1 (1). 1983." Proc. on Database Systems (TODS). 1229-39. S. and K." IEEE Trans. CHRISTODOULAKIS. A. and B. R. 655-96. P. "A Multimedia Office Filing System. J. J.. "Partial-Match Retrieval via the Method of Superimposed Codes. "A Two Level Superimposed Coding Scheme for Partial Match Retrieval. LEE. G. "Mathematical Analysis of Various Superimposed Coding Methods. VANDENBROEK. 3 (3). 1983. 1960. 494-97. "Parallel Free-Text Search on the Connection Machine System. COOKSON." Information Systems. TSICHRITZIS. R." Proc. 273-80.. on Software Engineering. D." CACM. G." ACM TOOIS. KAHLE. and S. 11 (2).. SACKS-DAVIS. D. KENT. 1979. C.

the traditional model of text used in information retrieval is that of a set of documents. This may be reasonable for many applications. namely: A basic structure is assumed (documents and words). and indices based on hashing.. ETH. In this chapter we discuss two new lexicographical indices for text. University of Waterloo. For more general applications. with optional relevance weights associated to each keyword. Canada N2L 3G1 Abstract We survey new indices for text. de Ciencias de la Computación. with emphasis on PAT arrays (also called suffix arrays). but not for others. called PAT trees and PAT arrays. Each document is assigned a list of keywords (attributes).1 INTRODUCTION Text searching methods may be classified as lexicographical indices (indices that are sorted).Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND PAT ARRAYS Gaston H. which it serves quite well. clustering techniques. This model is oriented to library applications. Briefly. Gonnet Dept. 5. Universidad de Chile. Waterloo. A PAT array is an index based on a new model of text that does not use the concept of word and does not need to know the structure of the text. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. Zurich.Books_Algorithms_Collection2ed/books/book5/chap05. Chile Tim Snider Centre for the New OED and Text Research. Our aim is to build an index for the text of size similar to or smaller than the text. Casilla 2777. it has some problems. of Computer Science. Santiago.htm (1 of 16)7/3/2004 4:19:40 PM . Switzerland Ricardo A. Ontario. Baeza-Yates Depto.

PATTM (Gonnet 1987. We prefer a different model. This structure was originally described by Gonnet in the paper "Unstructured Data Bases" by Gonnet (1983). the string that starts at that position and extends arbitrarily far to the right. all words except for those deemed to be too common (called stopwords) are indexed. We see the text as one long string. PAT arrays were independently discovered by Gonnet (1987) and Manber and Myers (1990). common database techniques are not useful in this context. instead of indexing a set of keywords.htm (2 of 16)7/3/2004 4:19:40 PM . how to do some text searches and algorithms to build two of its possible implementations. each document is considered a database record.Books_Algorithms_Collection2ed/books/book5/chap05. This model is simpler and does not restrict the query domain... it can be used. Fawcett 1989).2 THE PAT TREE STRUCTURE The PAT tree is a data structure that allows very efficient searching with preprocessing. Typical data-base queries are on equality or on ranges. or automatically by a computer. The main advantages of this model are: No structure of the text is needed. 5. used with the Oxford English Dictionary (OED). Manber and Myers' motivation was searching in large genetic databases.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND Keywords must be extracted from the text (this is called "indexing"). This section describes the PAT data structure. In the traditional text model. on any substring of the text. PAT arrays are an efficient implementation of PAT trees. We will explain how to build and how to search PAT arrays. It is not difficult to see that any two strings not at the same position are different." This paper describes PAT trees and PAT arrays. Gonnet used them for the implementation of a fast text searching system. and support a query language more powerful than do traditional structures based on keywords and Boolean operations. although if there is one. whether it is done by a person. Because the number of keywords is variable. Each position in the text corresponds to a semi-infinite string (sistring). almost any searching structure can be used to support this view of text. or to the end of the text. Queries are restricted to keywords. Furthermore. For some indices. and each keyword a value or a secondary key. No keywords are used. They seldom consider "approximate text searching. In 1985 it was implemented and later used in conjunction with the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The queries are based on prefixes of sistrings. that is. This task is not trivial and error prone.

on a time. . To apply our algorithms it is sufficient to be able to view the entire text as an array of characters. 5. the above sistrings will compare as follows: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. sooner or later.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND computerization of the Oxford English Dictionary. Sistrings are uniquely identified by the position where they start. in a far . In case the semi-infinite string (sistring) is used beyond the end of the actual text. or whether it can be viewed as such is not relevant.2. special null characters will be considered to be added at its end. The name semi-infinite is taken from the analogy with geometry where we have semi-infinite lines. Sistrings can be defined formally as an abstract data type and as such present a very useful and important model of text. Once upon a time . a far away land . . will consist of a single (possibly very long) array of characters. . . For example. . we will use a very simple model of text. in a . the comparison of two sistrings cannot yield equal. even if we have to start comparing the fictitious null characters at the end of the text).1 Semi-infinite Strings In what follows. this is simply given by an integer. has become well known in its own right. the PATTM system. we will have to find a character where they differ. . For the purpose of this section. and for a given. . but infinite in one direction. or text data-base. Our text. Note that unless we are comparing a sistring to itself. This comparison is the one resulting from comparing two sistrings' contents (not their positions). The name of the implementation. . these characters being different than any other in the text.. numbered sequentially from one onward. (If the sistrings are not the same. by inspecting enough characters. fixed text. . in a far away land . Whether the text is already presented as such.Books_Algorithms_Collection2ed/books/book5/chap05. a time. nce upon a time .. taken from a given starting point but going on as necessary to the right. as a software package for very fast string searching. . . Example: Text sistring 1 sistring 2 sistring 8 sistring 11 sistring 22 Once upon a time. . lines with one origin. the most important operation on sistrings is the lexicographical comparison of sistrings and will be the only one defined. A semi-infinite string is a subsequence of characters from this array.htm (3 of 16)7/3/2004 4:19:40 PM .

.Books_Algorithms_Collection2ed/books/book5/chap05. Notice that to reach the external node for the query 00101 we first inspect bit 1 (it is a zero.1 internal nodes.. that is. and thus all internal nodes of the tree produce a useful branching. rather than the skip value. integer displacements. In this example. For a text of size n. The external nodes in a PAT tree are sistrings. Knuth 1973. and they contain a reference to a sistring. once we reach our desired node we have to make one final comparison with one of the sistrings stored in an external node of the current subtree. This allows internal nodes with single descendants to be eliminated. and Gonnet 1988) constructed over all the possible sistrings of a text.1 shows an example of a PAT tree over a sequence of bits (normally it would be over a sequence of characters). we go right). and internal nodes are indicated by a circle and contain a displacement. both subtrees are non-null. . just for the purpose of making the example easier to understand.." and the highest is "upon a time.. If they do not coincide. . just the skip counter and the pointers to the subtrees. that is.1: PAT tree when the sistrings 1 through 8 have been inserted file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Patricia trees are very similar to compact suffix trees or compact position trees (Aho et al. In this case we have used. ." after the first 8 sistrings have been inserted. Flajolet and Sedgewick 1986. This makes the tree O(n) in size. we go left) then bit 2 (it is zero.2 PAT Tree A PAT tree is a Patricia tree (Morrison 1968. External nodes are indicated by squares. in each internal node. but this information will always be of a constant size. This may be given by an absolute bit position. Patricia trees store key values at external nodes. Later we will want to store some additional information (the size of the subtree and which is the taller subtree) with each internal node. Because we may skip the inspection of some bits (in this case bit 4). . then the key is not in the tree. with a relatively small asymptotic constant." 5. there are n external nodes in the PAT tree and n . Figure 5. or by a count of the number of bits to skip. the internal nodes have no key information. In addition. Figure 5. to ensure that all the skipped bits coincide. we show the Patricia tree for the text "01100100010111. A Patricia tree is a digital tree where the individual bits of the keys are used to decide on the branching. 1974).htm (4 of 16)7/3/2004 4:19:40 PM . Hence Patricia trees are binary digital trees. we go right). . Patricia trees have in each internal node an indication of which bit of the query is to be used for branching. and then bit 5 (it is a one. a one bit will cause a branch to the right subtree. then bit 3 (it is a one. the total displacement of the bit to be inspected.2.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND 22 < 11 < 2 < 8 < 1 Of the first 22 sistrings (using ASCII ordering) the lowest sistring is "a far away. we go left). A zero bit will cause a branch to the left subtree.

thus the searching time is proportional to the query length.2 shows the search for the prefix "10100" and its answer. (Knowing the size of the answer is very appealing for information retrieval purposes.2. then only those sistrings that are at the beginning of words (about 20% of the total for common English text) are necessary. It is important to notice that the search ends when the prefix is exhausted or when we reach an external node and at that point all the answer is available (regardless of its size) in a single subtree.. This is done with a single comparison of any of the sistrings in the subtree (considering an external node as a subtree of size one).1 Prefix Searching Notice that every subtree of the PAT tree has all the sistrings with a given prefix. the length of the query is less than O(log n). the Patricia tree has n external nodes. we can trivially find the size of any matched subtree. 5. Then prefix searching in a PAT tree consists of searching the prefix in the tree up to the point where we exhaust the prefix or up to the point where we reach an external node.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND 5. 5. independent of the size of the answer. At this point we need to verify whether we could have skipped bits. and will be a trade-off between size of the index and search requirements. For random Patricia trees.) Figure 5. that is. By keeping the size of each subtree in each internal node. the height is O(log n) (Pittel 1985.2: Prefix searching.3. Apostolico and Szpankowski 1987) and consequently with PAT trees we can do arbitrary prefix searching in O(log n) time. by construction. then all the sistrings in the subtree (which share the common prefix) are the answer. otherwise there are no sistrings in the answer. if we are interested in word and phrase searches. For example. but in some other cases not all points are necessary or even desirable to index. one for each position in the text. The decision of how many sistrings to include in the tree is application dependent. In practice..Books_Algorithms_Collection2ed/books/book5/chap05.3 Indexing Points So far we have assumed that every position in the text is indexed.htm (5 of 16)7/3/2004 4:19:40 PM . Figure 5. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.3 ALGORITHMS ON THE PAT TREE In this section we will describe some of the algorithms for text searching when we have a PAT tree of our text. If this comparison is successful. For some types of search this is desirable.

4 Longest Repetition Searching The longest repetition of a text is defined as the match between two different positions of a text where this match is the longest (the most number of characters) in the entire text. More precisely. we search each end of the defining intervals and then collect all the subtrees between (and including) them. the longest repetition can be found while building the tree and it is a constant. In this case.Books_Algorithms_Collection2ed/books/book5/chap05. the tallest internal node gives a pair of sistrings that match for the greatest number of characters. 5. . By keeping such a bit.3. and checking if the distance between positions (and order. we can find one of the longest repetitions starting with an arbitrary prefix in O(log n) time.htm (6 of 16)7/3/2004 4:19:40 PM .3 Range Searching Searching for all the strings within a certain range of values (lexicographical range) can be done equally efficiently. sort by position the smaller of the two answers. which will indicate on which side we have the tallest subtree. but for a subtree. It should be noticed that only O(height) subtrees will be in the answer even in the worst case (the worst case is 2 height . tallest means not only the shape of the tree but has to consider the skipped bits also. This means searching for the longest repetition among all the strings that share a common prefix. It is also interesting and possible to search for the longest repetition not just for the entire tree/text. the longest repetition will be given by the tallest internal node in the PAT tree. we can return the answer and the size of the answer in time O(log n) independent of the actual size of the answer. that is. This can be done in O(height) time by keeping one bit of information at each internal node. range searching is defined as searching for all strings that lexicographically compare between two given strings. For example." but not "abacus" or "acrimonious." To do range searching on a PAT tree. This is not very appealing if m1 or m2 are O(n). we need two bits per internal node (to indicate equal heights as well) and the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. this algorithm requires (m1 + m2) log m1 time. will not change unless we change the tree (that is." "acacia. better solutions based on PAT arrays have been devised by Manber and Baeza-Yates (1991). Then.1) and hence only O(log n) time is necessary in total. 5. If we want to search for all of the longest repetitions. Finally. "acc" will contain strings like "abracadabra. As before.2 Proximity Searching We define a proximity search as finding all places where a string S1 is at most a fixed (given by the user) number of characters away from another string S2. the text).3.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND 5. we traverse the unsorted answer set. The simplest algorithm for this type of query is to search for sl and s2. For a given text. If m1 and m2 (m1 < m2) are the respective answer set sizes. For the latter case (when one of the strings S1 or S2 is small). that is... the range "abc" . if we always want s1 before s2) satisfies the proximity condition." "aboriginal.3. searching every position in the sorted set. For a given text.

which may take exponential space/time with respect to the query size but is independent of the size of the text (Hopcroft and Ullman 1979).5 "Most Significant" or "Most Frequent" Searching This type of search has great practical interest. This type of search will also require a traversal. This may induce further minimization. and the like. k is the average size of each group of strings of the given property.3.htm (7 of 16)7/3/2004 4:19:40 PM . Convert character DFAs into binary DFAs using any suitable binary encoding of the input alphabet. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. b. finding the "most frequent" trigram is finding a sequence of three letters that appears most often within our text. By "most significant" or "most frequent" we mean the most frequently occurring strings within the text database. For example. that follows some given prefix.3. Techniques similar to alpha-beta pruning can be used to improve this search. We may also apply this algorithm within any arbitrary subtree.6 Regular Expression Searching The main steps of the algorithm due by Baeza-Yates and Gonnet (1989) are: a.. the number of occurrences of a trigram is given by the size of the subtree at a distance 3 characters from the root. Next eliminate outgoing transitions from final states (see justification in step (e). word. it traverses the tree to the places where each second blank appears.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND search becomes logarithmic in height and linear in the number of matches. but in this case the traversal is only done in a subtree (the subtree of all sistrings starting with a space) and does not have a constant depth. This is equivalent to finding the most frequently occurring trigram. but is slightly difficult to describe. Searching for "most common" word is slightly more difficult but uses a similar algorithm. and for the example of the trigrams. This can be achieved by a simple traversal of the PAT tree which is at most O(n/average size of answer) but is usually much faster. Convert the regular expression passed as a query into a minimized deterministic finite automation (DFA). Here. So finding the most frequent trigram is equivalent to finding the largest subtree at a distance 3 characters from the root. 5. 5. c. In terms of the PAT tree.. A word could be defined as any sequence of characters delimited by a blank space. finding the most frequent string with a certain property requires a subtree selection and then a tree traversal which is at most O(n/k) but typically much smaller. In all cases.Books_Algorithms_Collection2ed/books/book5/chap05.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. and m 0 (integer) depend on the incidence matrix of the simulated DFA (that is. associate its left descendant with state j if i j for a bit 0.. most likely unacceptable from the practical point of view.3: Simulating the automaton on a binary digital tree For random text. run the remainder of the automaton on the single string determined by this external node. the average searching time is sublinear (because of step e). an economic implementation is mandatory. the organization for the tree should be such that we can gain from the reading of large records external files will use physical records which are certainly larger than one internal node). Second. and.3). Simulate the binary DFA on the binary digital trie from all sistrings of text using the same binary encoding as in step b.Books_Algorithms_Collection2ed/books/book5/chap05. For every node of the index associated with a final state. accept the whole subtree and halt the search in that subtree. For details. for any internal node associated with state i. That is. they depend on the regular expression).. or about 18n chars tor indexing n characters. and the size of the index tree is linear in the size of the text. Since this type of index is typically built over very large text databases. 5. Each external node could be one word and consequently we are taking between 4n and 5n words for the index. Figure 5. It is easy to see that the internal nodes will be between 3 and 4 words in size. except for some implementation details which cannot be overlooked as they would increase the size of the index or its accessing time quite dramatically. one labeled 0 and one labeled 1.htm (8 of 16)7/3/2004 4:19:40 PM . it is possible to prove that for any regular expression. we do not need outgoing transitions in final states). see Baeza-Yates and Gonnet (1989).4 BUILDING PAT TREES AS PATRICIA TREES The implementation of PAT trees using conventional Patricia trees is the obvious one.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND each state will then have at most two outgoing transitions. associate the root of the tree with the initial state. and of the form: O(logm(n)n ) where < 1. e. d. On reaching an external node. and associate its right descendant with state k if i k for a 1 (see Figure 5. (For this reason. f.

Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND The main ideas to solve (or alleviate) both problems are a. and assuming a random distribution of keys. Unfortunately. we have about (n/bln2) internal nodes for random strings. and any search that has to be done in the bucket has to be done on all the members of the bucket. Collecting more than one external node is called bucketing and is an old idea in data processing. and b. buckets save a significant number of internal nodes. instead of n . disk pages will contain on the order of 1. The main idea is simple: we allocate as much as possible of the tree in a disk page as long as we preserve a unique entry point for every page.. this means that a typical prefix search can be accomplished with 2-3 disk accesses to read the index (about 30 to 40 tree levels) and one additional final access to verify the skipped bits (buckets may require additional reading of strings). This further reduces the storage cost of internal nodes. A bottom-up greedy construction guarantees at least 50 percent occupation. On the other hand.. This means that the overhead of the internal nodes. which are the largest part of the index.htm (9 of 16)7/3/2004 4:19:40 PM . and on average the number of keys per bucket is b In 2. Actual experiments indicate an occupation close to 80 percent. De facto. contains as much of the tree as possible.l internal nodes. Since it is very easy to keep the root page of the tree in memory. With these constraints. This means that on the average. as each disk page has a single entry point). and terminates either in external nodes or in pointers to other disk pages (notice that we need to access disk pages only. in the worst case by b. not all disk pages will be 100 percent full. it is not possible to have all buckets full. every disk page has a single entry point. can be cut down by a factor of b1n2. Hence.Books_Algorithms_Collection2ed/books/book5/chap05. The external nodes inside a bucket do not have any structure associated with them. We have then a very simple trade-off between time (a factor of b) and space (a factor of b1n2). A bucket replaces any subtree with size less than a certain constant (say b) and hence saves up to b .000 internal/external nodes. This implementation is the most efficient in terms of disk accesses for this type of search. The pointers in internal nodes will address either a disk page or another node inside the same page. This increases the number of comparisons for each search. bucketing of external nodes. each disk page will contain about 10 steps of a root-to-leaf path. Organizing the tree in super-nodes has advantages from the point of view of the number of accesses as well as in space.1 internal nodes. or in other words that the total number of accesses is a tenth of the height of the tree. mapping the tree onto the disk using supernodes. and consequently can be substantially smaller (typically about half a word is enough).5 PAT TREES REPRESENTED AS ARRAYS file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. 5. In general.

4 shows this structure. and greater than. that is. Although this is not significant for small buckets. then we do not need to do a sequential search..htm (10 of 16)7/3/2004 4:19:40 PM . compare the sistrings referred to by the external node) to find the nodes that satisfy the search.1 Searching PAT Trees as Arrays It turns out that it is not neccessary to simulate the Patricia tree for prefix and range searching. but its discussion becomes too technical and falls outside the scope of this section. we could do an indirect binary search (i. The "most frequent" searching can also be improved. Consequently. equal (or included in the case of range searching). 5. which is not possible to represent without an additional structure. this idea was independently discovered by Manber and Myers (1990). it is a crucial observation that allows us to develop another implementation of PAT trees. these costs become prohibitive. the searching takes at most 2 log2 n . the time increases by a O(log n) factor. The same can be said about "longest repetition" which requires additional supporting data structures. When a search reaches a bucket. Any operation on a Patricia tree can be simulated in O(log n) accesses.. Both can be implemented by doing an indirect binary search over the array with the results of the comparisons being less than. in log2 n comparisons in the worst case we can divide the interval according to the next bit which is different. If the bucket is too large. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The sizes of the subtrees are trivially obtained from the limits of any portion of the array.4: A PAT array.1 instead of b.. Details about proximity searching are given by Manber and Baeza-Yates (1991).ooks_Algorithms_Collection2ed/books/book5/chap05. prefix searching and range searching become more uniform. With some additions. Figure 5. the whole index degenerates into a single array of external nodes ordered lexicographically by sistrings. if the external nodes in the bucket are kept in the same relative order as they would be in the tree.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND The previous implementation has a parameter. the external node bucket size. There is a straightforward argument that shows that these arrays contain most of the information we had in the Patricia tree at the cost of a factor of log2 n. However. In this way. Figure 5.1 comparisons and 4log2 n disk accesses (in the worst case). b.5. For regular expression searching.e. This is simply to let the bucket grow well beyond normal bucket sizes. the cost of searching a bucket becomes 2 log b . Actually. The argument simply says that for any interval in the array which contains all the external nodes that would be in a subtree. who called the structures suffix arrays. and we obtain an algorithm which is O(log n) instead of O(log2 n) for these operations. so the only information that is missing is the longest-repetition bit. we have to scan all the external nodes in the bucket to determine which if any satisfy the search. including the option of letting it be equal to n.

This represents a significant economy in space at the cost of a modest deterioration in access time. 119.000 is. standard building of a Patricia tree in any of its forms would have required about n where n is the number of index points and t is the time for a random access to disk.4 years. prefix searching and range searching can be done in time O(log2 n) with a storage cost that is exactly one word (a sistring pointer) per index point (sistring). Quicksort is an appropriate algorithm for this building phase since it has an almost sequential pattern of access over the sorted file. the dictionary had about n = 119.000/30 60 60 24 = 45. the total disk time would be 119. First we will present the building operations. As it turned out. 5. 2nd ed.. there was considerable interest in indexing the OED to provide fast searching of its 600Mb of text. Even if we were using an algorithm that used a single random access per entry point (a very minimal requirement!). We would like to acknowledge this indirect contribution by the OED. In this case it is possible to build an index for any text which together with the program can fit in main file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.000. the random access patterns over the text will certainly cause severe memory thrashing. Note that here we are talking about main memory.9 days. Most of the research on this structure was done within the Centre for the New Oxford English Dictionary at the University of Waterloo. Still not acceptable in practical terms.5. Hence.ooks_Algorithms_Collection2ed/books/book5/chap05. we would say that we continue to research for better building algorithms although we can presently build the index for the OED during a weekend. For maximal results we can put all of the index array on external storage and apply external quicksort (see Gonnet and Baeza-Yates.750 hours or about 3. we would never have realized how difficult the problem was and how much more work had to be done to solve it in a reasonable time.4. then this process can be done very efficiently as it is equivalent to string sorting. if paging is used to simulate a larger memory. we would still be building the index for the OED. which can be done efficiently..000. It is clear from these numbers that we have to investigate algorithms that do not do random access to the text.000. This subsection will be divided in two. Building PAT arrays in memory If a portion of the text is small enough to fit in main memory together with its PAT array. and second.5 [1984]) indirectly over the text. section 4.000 and our computer systems would give us about 30 random 27/30 60 60 = 29. Clearly we need better algorithms than the "optimal' algorithm here. To conclude this note. That accesses per second. as without this real test case. two of the most prominent algorithms for large index building. but work based on different principles.2 Building PAT Trees as Arrays A historical note is worth entering at this point. which was in charge of researching and implementing different aspects of the computerization of the OED from 1985.htm (11 of 16)7/3/2004 4:19:40 PM .Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND In summary. That is. A log2(n) t hours.

this algorithm performs a linear amount of sequential input and output and O(n2 log n1) internal work. Once the counting is finished. When all of the merges have been done.6). The integer array is used to count how many sistrings of the big file fall between each pair of index points in the small file (see Figure 5. The text of the small file together with a PAT array for the small file (of size n1) plus an integer array of size n1 + 1 are kept in main memory.ooks_Algorithms_Collection2ed/books/book5/chap05.htm (12 of 16)7/3/2004 4:19:40 PM . the first piece being as large as possible to build an index in main memory. since it would fill memory and could not be merged with any further index. First we split the text file into pieces. the entire merge taking place in memory. This step will require O(n2 log n1) comparisons and O(n2) characters to be read sequentially.5: Small index in main memory Figure 5. When the nth block of text is being indexed. When all of the blocks have been indexed and merged. The remaining pieces are as large as possible to allow merging via the previous algorithm (small against large). a vector of counters is kept. To do this counting.5). These indices may be merged with each other.. the files of counters are used as instructions to merge all the partial indices. With today's memory sizes. We take advantage of this by constructing partial indices on blocks of text one half the size of memory. Then we build indices for all these parts and merge each part. This will require a sequential reading of n1+ n2 words. This algorithm is not trivial and deserves a short explanation. the merging takes place by reading the PAT array of the large file and inserting the PAT array of the small file guided by the counts (see Figure 5. The merged index is not created at this point. As before.. indicating how many entries of the first index fall between each pair of adjacent entries in the second. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Merging small against large PAT arrays A second case that can be solved efficiently is the case of merging two indices (to produce a single one) when the text plus twice the index of one of them fits in main memory. The corresponding counter is incremented.6: Merging the small and the large index Given these simple and efficient building blocks we can design a general index building algorithm. and its behavior is not only acceptable but exceptionally good. this is not a case to ignore. the n . This is the algorithm of choice for small files and also as a building block for other algorithms. The counters are accumulated with each merge.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND memory. the counters are written to a file. An improvement may be made by noticing that index points at the front of the text will be merged many times into new points being added at the end.1 previous indices are merged with it. In total. Figure 5. the large file is read sequentially and each sistring is searched in the PAT array of the small file until it is located between a pair of points in the index.

at the time that the final merge is being done. the final index. Another practical advantage of this algorithm is that the final index merge is done without reference to the text. 5. and the sum of partial indices may each be one or more gigabytes. and when convenient/necessary satisfy all requests in the best ordering possible. This technique works on algorithms that do not block on a particular I/O request but could continue with other branches of execution with a sufficiently high degree of parallelism. this is an important consideration. After all merges are complete. a list of satisfied requests The c list is processed in a certain order. then the b list is sorted for the best I/O performance and all the I/O is performed. Thus.. Whenever the c list is exhausted or the available memory for requests is exhausted. In situations where the text. More explicitly. the above quadratic algorithm would beat a linear algorithm even for databases like the Oxford English Dictionary! An interesting property of the last version of the algorithm is that each of the merges of partial indices is independent. the index building program b. First. Second. and overall the algorithm requires O(n2 / m) sequential access and O(n2 log n / m) time. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The technique is simply to suspend execution of the program every time a random input is required. Here we will describe a programming technique that tries to alleviate the above problem without altering the underlying algorithms. Given the sizes of present day memories and the differences of accessing time for random versus sequential access.htm (13 of 16)7/3/2004 4:19:40 PM . we have the following modules: a. the amount of text actually used is a small fraction of the text made available by an I/O operation.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND The number of parts is O(n / m) where m is the total amount of available memory. Therefore.ooks_Algorithms_Collection2ed/books/book5/chap05. This ordering is given inversely by the likelihood of a request to generate more requests: the requests likely to generate fewer requests are processed first. a list of blocked requests c. store these requests in a request pool. all of the O((n / m)2) merges may be run in parallel on separate processors.3 Delayed Reading Paradigm The fact that comparing sistrings through random access to disk is so expensive is a consequence of two phenomena. the text may be removed from the system and reloaded after removal of the partial indices.. the counters for each partial index must be accumulated and the final index merge may be done.5. the reading itself requires some relatively slow) physical movement.

000. whether they are existing large indices for separate text files. We can do this by reading the text associated with each key and comparing. each constructed in memory and being merged to produce a final file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.000 keys into a 32Mb memory.000 and we can do 50 random disk accesses per second. However.000. The sistring pointers in the PAT array are ordered lexicographically according to the text that they reference and so will be more or less randomly ordered by location in the text. small meaning that it fits. Thus.ooks_Algorithms_Collection2ed/books/book5/chap05. This will be dependent on the text being indexed.000. this algorithm will work for the general case of merging indices.000 index points it takes 200 passes. Clearly it is wasteful to read (and write) 600 characters for each key being put in a temporary file. In fact. and others were flagged as unresolved in the final index and fixed in a later pass. then we can read a large number of keys from the text in a sequential pass. this gives us O(n log m) comparisons. this means that on average we are reading 1 key per kilobyte. These keys are then merged by making a sequential pass over all the temporary files. with its index. there are two constraints affecting this algorithm: the size of memory and the amount of temporary disk space available. if we can read a sufficiently large block of pointers at one time and indirectly sort them by location in the text. In the case of the OED. The larger the number of keys read in a sequential pass. The keys are then written out to temporary disk space.4 Merging Large Files We have seen an algorithm for merging a large file against a small file. More importantly. This will be addressed later. Thus. so we can use sequential I/O. the greater the improvement in performance..Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND 5. If we have n keys and m files. this algorithm is not as effective as the large against small indexing described earlier.5. then this approach will take 2. we used 48 characters as a number that allowed about 97 percent of all comparisons to be resolved. for building an index for the OED. in memory. or whether they are partial indices for a large text file. An improvement can be made by reducing the number of random disk accesses and increasing the amount of sequential disk I/O. it gives us n random disk accesses to fetch key values from the text. We may also need to merge two or more large indices. If n is 100.htm (14 of 16)7/3/2004 4:19:40 PM . and use a heap to organize the current keys for each file being merged. For a 600Mb text. With 48 characters per key it was possible to read 600. or approximately 150 hours. the longest comparison necessary was 600 characters. A second problem inherent with sistrings is that we do not know how many characters will be necessary to compare the keys..000 seconds or about 23 days to merge the files. Each reading of the file needs between 30 and 45 minutes and for 120. They must be balanced by the relationship memory number of files = temporary disk space We must also consider how many characters are necessary for each comparison. processing them sequentially one at a time necessitates random access to the text. Thus. being merged to allow simultaneous searching. The greatest improvement is achieved when the entire memory is allocated to this key reading for each merge file in turn. However.

A. GONNET. false matches are not frequent.htm (15 of 16)7/3/2004 4:19:40 PM . and W. HOPCROFT. increasing the number of keys that can be merged in each pass. the entire text may be loaded into memory. but they do occur. Mass.. eds.. Italy: Springer-Verlag. we know that the entire text fits in memory. Ind.6 SUMMARY We finish by comparing PAT arrays with two other kind of indices: signature files and inverted files. regular expression searching. Department of Computer Science. there are two problems. Another improvement in this situation can be made by noticing that lexicographically adjacent keys will have common prefixes for some length. 1974. When we are reading keys. which is a drawback for large texts. and S. R. The Design and Analysis of Computer Algorithms. 5. "Efficient Text Searching of Regular Expressions.ooks_Algorithms_Collection2ed/books/book5/chap05. 46-62.. REFERENCES AHO. pp. and J. 1989. most frequent searching... APOSTOLICO. Ronchi Della Rocca. without the constraint that the keys to be written must fit in memory. longest repetitions.: Addison-Wesley. On the other hand.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND index." Technical Report CSD-TR-732. SZPANKOWSKI. J. A. we may find some answers that do not match the query. Stresa. Signature files use hashing techniques to produce an index being between 10 percent and 20 percent of the text size. Purdue University. Reading. BAEZA-YATES. and G. and so on. First. Keys may then be written out to fill all available temporary disk space. however. that are either difficult or very inefficient over inverted files. ULLMAN. Ausiello. and some kind of filtering must be done (time proportional to the size of the enlarged answer). Similar performance can be achieved by PAT arrays. That is the case with searching for phrases (especially those containing frequently occurring words).. Since partial indices were constructed in memory. The big plus of PAT arrays is their potential use in other kind of searches. Moreover. the search time on the index is linear. G. Second. In the latter case. inverted files need a storage overhead varying from 30 percent to 100 percent (depending on the data structure and the use of stopwords). Lecture Notes in Computer Science 372. 47907. Typically. The storage overhead is small. M. The search time for word searches is logarithmic. We can use "stemming" to reduce the data written out for each key. "Self-alignments in Words and their Applicatons." in ICALP'89. 1987. approximate string searching. DezaniCiancaglini. Also as UW Centre for the New OED file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. West Lafayette. the full use of this kind of index is still an open problem. there are some further improvements that can be made.

G.htm (16 of 16)7/3/2004 4:19:40 PM .: Addison-Wesley. and R. 748-67. U. 1987." in 1st ACM-SIAM Symposium on Discrete Algorithms..: AddisonWesley. "PAT 3.ooks_Algorithms_Collection2ed/books/book5/chap05.1: An Efficient Text Searching System. D. G. B. "Digital Search Trees Revisited. Centre for the New Oxford English Dictionary. J.3. OED-89-01. FLAJOLET. User's Guide. 414-27. GONNET. and G. 1968. 37. University of Waterloo. Atlanta. Reading. 13. 117-24." Technical Report OED-88-02.Information Retrieval: CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND Report. MORRISON. GONNET. PITTEL. "Efficient Searching of Text and Pictures (extended abstract). London: Addison-Wesley. Ga." JACM. P. User's Manual. vol. Go to Chapter 6 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. MANBER. 1973. Mass. "PATRICIA-Practical Algorithm to Retrieve Information Coded in Alphanumeric. Mass.. G. 15. 1991. "Unstructured Data Bases or Very Efficient Text Searching. 514-34. G. "An Algorithm for String Matching with a Sequence of Don't Cares. 3. FAWCETT. San Francisco. 319-27. vol. KNUTH. BAEZA-YATES. and J. 1985. 2." UW Centre for the New OED. "Suffix Arrays: A New Method for On-line String Searches. Introduction to Automata Theory." The Annals of Probability. pp. 1979. HOPCROFT.. "Asymptotical Growth of a Class of Random Trees. University of Waterloo. 1983.. 1989. 1984. 1990. SEDGEWICK. Centre for the New OED. ULLMAN. and R." SIAM J Computing. pp. 1988. MYERS. University of Waterloo. 15. University of Waterloo. D. GONNET." Information Processing Letters. Reading. The Art of Computer Programming: Sorting and Searching." in ACM PODS. 1986. 1989.. MANBER.. 133-36. U. GONNET. April. Handbook of Algorithms and Data Structures. H. A Text Searching System: PAT 3.

N. including M. It is important to understand and study this young and significant technology and to design retrieval structures that best utilize its characteristics. Waterloo. Ontario. including the Write-Once B-Tree of Easton (1986). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. the Compact Disk File System of Garfinkel (1986). At about this time. it was first demonstrated on September 5.C..Books_Algorithms_Collection2ed/books/book5/chap06. we discuss in detail a variety of different structures that have been developed. and was called Video Long Play (VLP). and BIM trees of Christodoulakis and Ford (1989b). Buffered Hashing of Christodoulakis and Ford (1989a). It then proceeds to discuss six file systems that have been developed for optical disks. Optical disks have immense capacities and very fast retrieval performance. V. We then discuss some technical issues affecting the implementation of file structures on some common forms of optical disk technology. These characteristics are making them a serious threat to the traditional dominance of magnetic disks. they are also rugged and have very long storage lifetimes.A. 1972. KODAK and Harris.2 OVERVIEW OF OPTICAL DISK TECHNOLOGY The foundation for all current optical disk technology was formed in the late 1960s and early 1970s by early video disk research. Canada Abstract Optical disk technology is a new and promising secondary storage technology. 6.1 INTRODUCTION In this section we discuss file structures for optical disks. In mid-1971. explaining where it came from and how it works. University of Waterloo. 3-M.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS Daniel Alexander Ford and Stavros Christodoulakis Department of Computer Science. We first present an overview of optical disk technology. This chapter first presents a tutorial on optical disk technology and a discussion of important technical issues of file system design for optical disks. Philips began conducting experiments in recording video signals on a flat glass plate using a spiral track of optically detectable depressions. Thomson-CSF.htm (1 of 18)7/3/2004 4:19:45 PM .. 6. among others. the Optical File Cabinet of Gait (1988). This system was refined until it could store 30 minutes of color video and sound. many of which are quite subtle. Later. the Time Split B-Tree of Lomet and Salzberg (1989). other companies also began research efforts..

are also completely unaffected by magnetic fields. In general. which are designed to endure unrestrained consumer use. the technological base for the development of the optical disks in use today was emerged from the research efforts in the fields of optics. banking records. Small scratches and dust on the disk surface do not affect stored data as they are far enough away from the recording surface to be out of the optical system's focal plane. and long storage life are major advantages. Erasable optical disks cannot prevent accidental or malicious data destruction. tracking and focus control servo system and lasers.htm (2 of 18)7/3/2004 4:19:45 PM . as can CDROM disks. There is no chance of accidental erasure and overwrites are usually prevented by the drive or by driver software. for instance. The integrity of data stored on optical disks is also impressive.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS The development of small inexpensive semiconductor laser diodes in 1975 stimulated development further. Features and benefits Optical disks have many features and benefits that make them a useful storage medium. fast random access.. depending how it is handled and stored.Books_Algorithms_Collection2ed/books/book5/chap06. but are still more durable then magnetic media.. Eventually. Some records such as transaction logs. but accelerated aging tests place it at least ten years and possibly as high as thirty. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. or erased. since there is no physical contact between an optical disk platter and the access mechanism. make them the storage medium of choice for archival type applications. so some systems that maintain these types of records use WORM disks to ensure this. In contrast. Data is also not subject to destruction from external magnetic fields and does not require periodic rewriting as is the case for magnetic media. their immense capacity. The unerasability of WORM type optical disks. They are not subject to wear or head crashes as there is no physical contact between the disk and the access mechanism. The process by which data is recorded on a Write-Once Read Many (WORM) disk surface causes an irreversible physical change. WORM and erasable disks are encased in a sturdy protective cassette when not in the drive and can be easily and safely transported. but rather. There are few disadvantages inherent to optical disk technology. In fact. disk material technology. is protected by the plastic substrate that forms the disk platter. Optical disks. Optical disks also have the advantage of being removable. including erasable optical disks which employ a magneto-optic process for data registration. Disks usually come encased in a protective cassette that can be easily carried. and the use of plastic disk platters which were cheaper and more easily replicated than glass platters was pioneered. The recording surface itself is not exposed. the lifetime of magnetic tape is between two years to ten years. CD-ROM (Compact Disk Read Only Memory) disks are physically pressed out of plastic and so cannot be written to. Other advantages are their portability and durability. The expected lifetime of an optical disk is not really known for certain (they have not been around long enough). the disk is not subject to wear with use (the head of a Winchester type magnetic disk actually rides on the lubricated disk surface before it reaches flying speed). Further. and school transcripts are never legitimately altered. some characteristics that might be considered undesirable are actually advantages in certain applications.

Erasable disks are fabricated in a similar manner.htm (3 of 18)7/3/2004 4:19:45 PM ." In the single drive arrangement. For WORM disks. The time needed to switch a disk is about 5 to 20 seconds. The way in which a disk platter is constructed varies with the type of optical disk. 88.5 millimeter (12 inch) platter is the most widely used for WORM disks and has a capacity of roughly 1 gigabyte per disk side. the disk is clamped to a spindle that protrudes through a large hole in its center. WORM disks are also available in 203 millimeter (8 inch. turned over. the disk is given a thin coating of aluminum to make it reflective and is then sealed. 200 megabytes) sizes. 87. There are three common sizes and storage capacities of optical disks available. After pressing. To access the other side. and optics are the same. In the jukebox arrangement.5 inch. The exact capacity of a disk will depend on the recording format employed (discussed later). the platter is actually released and the cassette withdrawn. the process is reversed. The CD-ROM disk platters are standardized at 130 millimeters. The 305. Note that only one side of the disk surface can be accessed when the disk is mounted. CD-ROM disks are single sided and cannot be accessed if improperly mounted. 80 megabytes) sizes. some formats will increase the values stated above. Electrical interfacing to the disk drives is usually done via standard interfaces such as the Small Computer Serial Interface (SCSI). or 89 millimeter (3. 600 megabytes) and. CD-ROM disks are single sided and are essentially a smaller version of one of the platter halves used for WORM disks. The two halves both support and protect the recording surfaces while allowing light from the laser to reach them. associated access mechanisms.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS Technical description Optical disks are typically available in two different access configurations. disks. representing many gigabytes of storage. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo." This fabrication technique joins two transparent 1. Erasable disks are available in 130 millimeter (5. WORM disks usually use a construction know as the "air sandwich. the drive.5 millimeter thick platter halves together leaving a small "clean room" air gap sandwiched between them where the recording surfaces reside. either mounted in a single disk drive or as a set of disks in a "jukebox. to remove the disk. the Enhanced Small Device Interface (ESDI). The final disk platter is about 3 to 4 millimeters thick. They are selected and mounted mechanically in a manner similar to that used in audio disk jukeboxes. depending on the manufacturer. or more commonly for small personal computers through a direct bus interface. the disk must be physically removed. and remounted. Data is permanently registered on a CD-ROM disk when it is fabricated by a pressing process. 750 megabytes) and 130 millimeter (5. 86. For both types of configurations. The sled contains the laser and associated optics.25 inch. The access mechanism is usually a sled that slides along a guide path beneath the rotating disk platter. When mounted..25 inch. the disks are stored in groups of several tens or even hundreds.. disks are mounted manually be inserting the disk (and its protective cassette for WORM and erasable) into a slot on the front of the drive.Books_Algorithms_Collection2ed/books/book5/chap06.

which is one side of an "air sandwich. amorphous and crystalline. This is done under clean room conditions as the high storage densities of optical disks makes them particularly sensitive to contaminants like dust. and a supporting plastic substrate.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS Materials and registration techniques for WORM disks A single-sided optical disk platter. It is during the molding process that the disk is pregrooved and sometimes also preformatted with address and synchronization information. Of these. leaving a hole in the surface. Spacing between adjacent tracks of pits and lands is 1.6 microns. There are several methods by which the applied energy can form pits in the active layer. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The disk is formed using an injection molding process. The bubble forms a bump that tends to disperse the light hitting it (making the pit darker). There are two types of erasable optical disk: Phase-Change and Magneto-Optic. The main difference between the two is in the coating used on the recording surface. and will switch between the two when heated to two different temperatures. a bubble is formed in the recording surface when vapor resulting from the heating process is trapped under the film. Materials and registration techniques for erasable disks Erasable optical disks are physically much like WORM disks. These steps are important as they define the positions of tracks and identify individual sectors on what would otherwise be a featureless surface. Small changes in reflectivity called "pits" are thermally induced in the active layer through the application of energy from a laser." It takes 50-100 nanoseconds of 10 milliwatts of incident power focused on a 1 micron diameter spot to form the pit." consists of a thin film of Tellurium alloy (10-50 nanometers) that forms the active recording surface. surface tension causes the thin film to pull away from the spot heated by the laser. usually poly(vinyl chloride) or poly(methyl methacrylate). The unaltered spaces between pits are called "lands.Books_Algorithms_Collection2ed/books/book5/chap06. the spot will melt and when it cools will revitrify to the amorphous state.. Data is recorded on the disk by a series of optically detectable changes to its surface. With the Vesicular technique. Sometimes. If it is heated to a higher temperature with a laser with 18 milliwatts of power. With the Ablation technique.. These coatings have the ability to exist in two different optically detectable states. the spot will crystallize. a layer of aluminum is deposited beneath the tellurium film to act as a mirror to be exposed when the hole is formed. the techniques called Ablation and Vesicular are favored.htm (4 of 18)7/3/2004 4:19:45 PM . If a spot on the recording surface is heated to a low temperature with a laser with 8 milliwatts of power. Erasable optical disks that employ phase-change technology rely on coatings for the recording surface consisting of thin films of tellurium or selenium.

This effect allows the magnetization of a spot or domain of the disk surface to be detected optically. if north-pole-down. terbium iron. a north-pole up domain represents a 1 bit. Reading data from the disk is simply a matter of scanning the surface with a low power laser (1 milliwatt) and detecting the changes in reflectivity that exist on the recording surface. If a crystallized spot is scanned with 8 milliwatts of power. The remaining unheated portions of the sector retain their north-pole down magnetization. counter-clockwise. As it does. Reading relies on a physical effect known as the Kerr magneto-optic effect. it will remain in the amorphous state. but read and write it optically. and if it is scanned with 8 milliwatts of power it will switch to the crystallized state. the polarization of the light reflected from the surface will be rotated clockwise. which was discovered by Kerr (1877) and which causes light passing through a magentic field to become elliptically polarized. To read a disk sector. It has the property of allowing the polarity of its magnetization to be changed when it is heated to a certain temperature (150 C). data can be written by the second. By modulating the power of the laser. When the sector is scanned a second time. domains of north-pole down are recorded throughout the sector. selected portions of the sector can be heated to the required temperature and have their magnetizations reversed to north-pole up. a north-pole down domain represents a 0 bit.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS To register data on the recording surface. This instability limited the number of cycles that could occur and was caused by the high temperature (600 C) required to change the magnetization of a domain. if it is scanned with 18 milliwatts of power it will switch to the amorphous state. The development of newer coatings that require lower temperatures (150 C) solved this problem. The sequence of polarity changes is detected and interpreted to produce a bit stream. it will remain crystallized. the power of the laser scanning the disk is simply modulated between 8 and 18 milliwatts. Similarly. If the magnetization of a domain being scanned is north-pole-up. Recorded data is read from the disk in a single pass. or gadolinium terbium iron. The first pass serves to erase the contents of the sector.. As the sector is scanned.. Current erasable magneto-optic file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. This is done by first placing a magnetic field with north-pole down in the vicinity of the spot upon which the laser focuses.htm (5 of 18)7/3/2004 4:19:45 PM . The coating used on the recording surface is a rare-earth transition-metal alloy such as terbium iron cobalt. The main stumbling block in the development of erasable magneto-optic disk technology was the chemical instability of the coating caused by its repeated heating and cooling during the write-erase cycle. Erasable optical disks that employ magneto-optic technology store data on the disk magnetically. Recording data is a two-stage process requiring two passes of the disk head over a sector. Once the sector has been erased by the first pass. if an amorphous spot is scanned with l8 milliwatts of power. this spot will quickly heat to 150 C and then immediately cool. the applied magnetic field is reversed. it is scanned with the laser in a lower power mode (1 milliwatt) than used for writing (8 milliwatts).Books_Algorithms_Collection2ed/books/book5/chap06.

This is a difficult feat to accomplish as the requirements for economical mass production of disks and drives imply a certain degree of flexibility in their precision. Lasers and light paths The light source in an optical disk drive is a Gallium Arsenide (GaAlAs) laser diode with a wavelength of about 800 nanometers (. The recording density is limited by the wavelength of the laser because the wavelength determines the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. but they were later abandoned in favor of the longer wavelength semiconductor laser diodes.Books_Algorithms_Collection2ed/books/book5/chap06. This requires more optics and associated hardware in the disk head of a magneto-optic disk drive to detect such a small change. To follow the moving track. The task of the assembly is to focus a beam of coherent light from a semiconductor laser on to a 1 micron size spot and provide a return path for reflected light to reach a photodetector. There are performance differences between the two types of erasable technologies. The Kerr effect only causes about a 1 percent change in the polarization of the reflected light. As such. This radial flexibility gives most optical disk drives the ability to quickly access more than one track on the disk from a single position of the access mechanism. the objective lens is encased in a voice-coil like arrangement that allows it to be moved up and down to adjust its focus and from side to side to allow it to follow the wandering track.8 microns).. which are cheaper and small enough to be mounted in the optical assembly. The reflectivity difference between the two states of phase-change type erasable optical disks is relatively large and much easier to detect so the disk head can be much simpler. The first experimental drives employed gas lasers (HeNe) which have a shorter wavelength and allowed higher storage densities and data transfer rates. eliminating the need for the expensive external acoustooptic or electrooptic modulator required by the gas lasers. it cannot be guaranteed that the disk platter will be perfectly flat or round.htm (6 of 18)7/3/2004 4:19:45 PM . This viewing capability is usually limited to a window of some 10 to 20 tracks on either side of the current position of the access mechanism.. or that the hole for the drive's spindle will be perfectly centered.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS disks now allow some ten million or more write-erase cycles. The net result is that the seek performance of a magneto-optic disk drive will generally be poorer than that of a phase-change drive because of its more massive disk head. Laser diodes also have the advantage of being modulated electrically. These imperfections can cause the outer edge of the disk to move up and down as much as 1 millimeter and the disk itself side-to-side as much as 60 micrometers (37 tracks) as it rotates several times a second. Optics The optical assemblies found in all types of optical disk drives are similar in nature and resemble those of a medium power microscope. simply by adjusting the position of the objective lens. These motions cause the position of a track to vary with time in three dimensions and require the optical assemblies to be much more than a simple arrangement of lenses.

appropriately enough.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS size of the smallest feature that can be resolved on the disk surface. CAV and CLV. also called Quantized Linear Velocity--QLV). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Books_Algorithms_Collection2ed/books/book5/chap06. the sequence of pits and lands forms a single spiral track and the rate at which the disk platter rotates is adjusted by the drive to match the position of the optical assembly. Modified Constant Angular Velocity (MCAV) and Modified Constant Linear Velocity (MCLV. The shorter the wavelength. The light path in the optical assembly consists of a collimating lens. and an objective lens. There are four different formats now in current use.htm (7 of 18)7/3/2004 4:19:45 PM . CLV disks have a greater storage capacity than CAV disks. Now 180 degrees out of phase. The most common are the Constant Angular Velocity (CAV) and the Constant Linear Velocity (CLV) formats. the smaller the feature and hence. As stated above. Recording formats While optical disks as a family share similar optical assemblies. a polarizing beam splitter. This elongation causes the storage density on the surface of a CAV format WORM disk to be less at the outer edge of the disk than at the center. The altered beam then enters the objective lens that focuses it on the disk surface to the required 1 micron size spot. have complimentary advantages and disadvantages. but they also tend to have slower seek times. The other two are modified versions of the above called. In the CLV format. a quarterwave plate.. Adjusting the rotation rate prevents the pits and lands from becoming elongated and results in a constant storage density across the surface of the disk. disk fabrication materials and techniques.. the sequences of pits and lands are usually arranged into equal capacity concentric tracks and the disk drive rotates the disk at a constant rate (angular velocity). The disk rotates faster when accessing the inner surface and slower when accessing the outer surface. light reflected from the disk surface passes back through the objective lens and on again through the quarter-wave plate where its polarization is given another 90 degree twist. accesses to that data are slightly faster. The collimating lens takes the highly divergent light from the diode and forms a straight directed beam. In the CAV format. the higher the possible recording density. the returning beam is not passed by the beam splitter but is instead reflected perpendicularly toward the photodetector. The two formats. This causes the length of both the pits and the lands to become elongated as their position moves away from the center of the disk because the outer surface of the disk platter passes beneath the optical assembly at a faster linear rate than does the inner surface. This adjustment ensures that the disk surface passes beneath the assembly and its optics at a constant rate (linear velocity). So while a CAV disk may store less data than a CLV disk. they can differ considerably in their recording formats. On the return path. This is a consequence of the extra time required to accelerate the disk to the rotation rate that matches the new position of the access mechanism. This beam passes unchanged through the beam splitter and on to the quarter-wave plate that rotates its polarization by 90 degrees (a quarter of a wave).

extra storage space is consumed. all nodes on the path file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The same is true for trees: if a node is modified. Thus. A disk employing the MCAV format rotates at a constant rate and also has a nearly constant storage density across its surface. some differences that can make some choices better than others. before we proceed we first present a short discussion on some technical issues affecting the implementation of file structures on optical disks. But. we concentrate our discussion on file systems developed for WORM and CDROM optical disks.htm (8 of 18)7/3/2004 4:19:45 PM . however. All the formats are available for WORM disks. in the course of normal file maintenance operations. 6. Since storage space cannot be reclaimed on a WORM optical disk. This type of problem is present when linked lists or trees are used (as is true for B-trees).. Changing those pointers in turn changes their positions on the disk requiring any pointers to their old positions to be changed as well. 6. then it must be stored on the disk in a new position and all pointers to the position of the old version of the node must be updated to point to the position of the new version. For example. Its concentric tracks are divided into equal track capacity bands. B-trees) are a poor choice for WORM optical disks.1 Technical Issues for File Systems Optical disk technology is similar enough to magnetic disk technology that the same types of file structures that are used on magnetic disks can usually be used in some form on optical disks. Commercially available erasable optical disks are a relatively recent phenomenon and are close enough in capabilities to magnetic disks that conventional file systems can usually be adapted.g. CD-ROM disks are standardized and use only the CLV format. A clocking scheme adjusts to the varying linear rates at which pits and lands pass beneath the optical assembly. Each band has one more sector in each of its tracks than the next innermost band that it surrounds. Erasable disks are available in the CAV and MCAV formats.. This means that if an element of a list is modified. There are. The MCLV scheme is similar except that the disk platter rotates at a different rate for each band. conventional pointer linked file structures (e. rebalancing a B-tree after an insert or delete is a good example.3.Books_Algorithms_Collection2ed/books/book5/chap06.3 FILE SYSTEMS Virtually all of the research into file systems for optical disks has concentrated on WORM and CDROM optical disks..Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS The Modified CAV and Modified CLV formats each combine features of the CAV and CLV formats in an attempt to obtain greater storage capacities and seek times. The reason for this is that each modification of the file usually requires some of the pointers linking the file structure together to change. all elements between it and the head of the list must be duplicated. These are the oldest forms of optical disk technology and the types with the most differences from magnetic disks. Thus. consuming space. If an element or node is modified. changing the value of a pointer requires the new value to be stored in a new disk sector.

Backward pointers do not have this problem as they always point to valid sectors. and can never be updated. when a node is modified.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS up to.. 6. The difference between the two structures is the manner in which the contents of the tree's nodes are maintained. making the forward pointer invalid. This practice recycles storage space. Sector substitution schemes that might deal with this problem can be envisioned. the spiral track found on CLV format disks lends itself nicely to hash file organizations by allowing hash buckets of arbitrary size to be created. This is not the only problem with using pointer linked structures on WORM optical disks. if space has been reserved on a disk. the root.2 Write-Once B-Tree The Write-Once B-Tree (WOBT) of Easton (1986) is a variation of a B-tree organization developed for WORM disks. Most disk drives are unable to differentiate between reading a blank sector and reading a defective sector (i. must be duplicated. its contents would become inconsistent with the error correction code and the drive would either correct the "error" without comment or report a bad sector. If a sector is modified after its initial write. Also.3. it will be impossible to detect the difference between the beginning of preallocated (and blank) space on the disk and a previously written sector that is now unreadable (possibly because of a media defect. Preallocation of disk space on a WORM disk can also lead to problems. For some file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Thus. With the inability to update the contents of a disk sector.. A forward pointer is one that points from an older disk sector (written before) to a younger disk sector (written after).. they both are unreadable). the direction that the pointers point is also of considerable consequence. but a better solution is to simply avoid the problem where possible by using backward pointers. or scratches). A sector may be written once. A further aspect of WORM optical disk technology to consider when designing file structures is the granularity of data registration. On CD-ROM disks. and including. On a WORM disk the smallest unit of data storage is the disk sector which has a typical capacity of 1 kilobyte. It is a characteristic of WORM optical disks that it is not possible to detect a bad disk sector before the sector has been written. Not all of the characteristics of optical disks lead to problems for the implementations of file structures. since the data never changes some optimizations are possible.Books_Algorithms_Collection2ed/books/book5/chap06. In a conventional B-tree. its previous value or state is discarded in favor of the new value. This restriction comes from the error correction code added to the data when the disk sector is written.e. dirt. there is a danger of wasting the storage capacity of the disk through sector fragmentation. there is a chance that the sector is points to may subsequently turn out to be unusable. if a forward pointer is stored on the disk. It is not possible to write a portion of a disk sector at one time and then another portion of the same sector later. and a backward pointer points from a younger sector to an older sector.htm (9 of 18)7/3/2004 4:19:45 PM . but is not desirable in all situations. and only once. for example. This inability will render unreliable any organization that depends on preallocated space on a WORM disk. allowing bucket overflow to be eliminated. thus.

so that accesses with respect to a previous time are still possible. in this case A/2 and F/3. node 1 has a NULL (0) pointer for this entry. but can still be accessed as part of the old version of the tree (prior to the insertion of "J") through the old root. The most current state of the tree is represented by the latest version of each entry (including pointers) in a node. D. we can make room in the new node by not including the data/pointer pair C/2 which is now obsolete. The first data entry indicates that C is the highest record in node 2. F. Figure 6.2. it is simply appended to node 2 in the space available for it. The result is illustrated in Figure 6. When we insert a further record "J" into the tree.1: Write-once B-tree When a record A is added to the tree. Node 3 is no longer in the current version of the tree. nodes 2 and 3. Deletions are handled by inserting a deletion marking record that explicitly identifies the record to be deleted and its deletion time. and F the highest in node 3. When a node is split in a WOBT. node 6. but instead appending new time-stamped entries to the node. The WOBT manages the contents of the tree nodes in a manner that preserves all versions of a node throughout the life of the tree. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. This is accomplished by not overwriting nodes when they change. When we access the tree with respect to the current time. we use only the most recent values of the record/pointer pairs.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS applications the ability to access any previous state of the tree is a distinct advantage. The old version is retained intact.2: Write-once B-tree after insertion of "A" The root now contains two entries which point to node 2. The root has an extra entry used to link different versions of the root node together. When node 1 is "split" in this case. Being the first root of the tree. The diagram in Figure 6. This is illustrated in Figure 6. node 1. which are found by a sequential search of the node. the WOBT allows this. The extra entry in the new root node now points back to the old root.3. we must split both node 3 and the root node 1. This results in one new node. When we split node 3 we end up with two new nodes 4 and 5. Figure 6. and an entry is propagated up to the parent of node 2. The rest of the entries in the root point to the children of the root...ooks_Algorithms_Collection2ed/books/book5/chap06. the root.1 illustrates a WOBT with three nodes containing the records C. The record A in node 2 would not be found in that search because its later time-stamp would disqualify it. we find the record/pointer pairs C/2 and F/3. G.htm (10 of 18)7/3/2004 4:19:45 PM . The diagram shows the root node labeled 1 and two children. If we access the tree with respect to a time before the insertion of A. and H. only those entries that are valid at the current time are copied into the new version of the node.

e. Combining magnetic and optical storage technologies allows their individual strengths to complement each other. It is also not particularly efficient with frequent updates to a single record. the old version of the node is removed from the magnetic disk and stored on the WORM optical disks.The magnetic disk allows file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. each with roughly half of the records (historical and current) of the original node. This is an idea also employed in the buffered hashing organization of Christodoulakis and Ford (1989a). The basic structure and operation of the TSBT are identical to that of the WOBT.ooks_Algorithms_Collection2ed/books/book5/chap06.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS Figure 6. as well. despite these drawbacks. A time split makes a new version of the node that omits the historical versions of records. A time split occurs when a node is full of many historical (i. The WOBT also has problems with sector fragmentation since all modifications of the tree are stored as they are made directly on the WORM disk. since each update must be stored on the disk in at least one disk sector.. A TSBT employs a magnetic disk to store the current changeable contents of the tree and the write-once disk to store the unchangeable historical contents. The difference between the two is that the TSBT distributes the contents of the tree over both a magentic disk and a WORM optical disk. Write-once optical disks have a very low cost per stored bit and. The migration of the historical contents of the TSBT to the write-once disk is a consequence of node splitting. and employs a slightly different approach to node splitting that reduces redundancy. The advantages of the Time-split B-tree over the Write-once B-tree are many. Magnetic disks offer faster access and allow modifications to stored data without consuming storage space.3: Write-once B-tree after insertion of "J" Features to note about the WOBT are that search times will be slower than for a conventional B-tree as extra time will be required for a sequential search of each node to determine the current values of the data/pointer pairs. A key split is the normal type of split associated with conventional B-trees and occurs when a node is full and most of the records are current (in a conventional B-tree. not in the current version of the tree) entries. the records are always current). ensure that data cannot be accidently or maliciously deleted. 6. The result is two new nodes stored on the magnetic disk.htm (11 of 18)7/3/2004 4:19:45 PM .3. However... The TSBT divides node splits into two types: Time Splits and Key Splits. the WOBT is a robust (using backward pointers only) and elegant solution to the problem of efficiently implementing a multiway tree on an indelible storage device. In the TSBT.3 Time-Split B-Tree The Time-Split B-tree (TSBT) of Lomet and Salzberg (1989) is an enhancement of the Write-once Btree that eliminates some of its problems while adding to its utility.

. but receives a new version number. If the magnetic disk is full. and links. the amount of remaining storage space on the optical disk will be irrelevant. to make efficient use of storage space. Lower sector fragmentation is also a result.. If the size of the current version of the B-tree (i. directories. The application for which it is primarily intended is to organize files that experience few modifications. When implementing the TSBT. The smallest unit of registration in the CDFS organization is the file..ooks_Algorithms_Collection2ed/books/book5/chap06. The goals of the CDFS are to be completely transportable across a variety of modern operating systems. then time splits should be favored as they free occupied space on the magnetic disk. At the end of a transaction. but by the capacity of the magnetic disk. the new copy retains the sequence number. It also allows transactions to make nonpermanent entries before they commit. The basic unit of organization in the CDFS is called a "transaction. and to have a relatively high retrieval performance. The EOT record contains a link to the EOT record of the previous transaction allowing access to historical versions of the organization (a dummy EOT record is stored at the start of an empty disk). The last transaction on the disk is the starting point for all accesses and the directory list it contains represents the current version of the file hierarchy. as it stores all of the current contents of the tree. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.3. but rather a structure for the organization of groups of complete files. then key space splits should be favored as time splits tend to increase redundancy in the database. the part of the tree being stored on the magnetic disk) is a concern. an updated directory list for the entire file system is stored along with an "End of Transaction" (EOT) record. since buffering on the magnetic disk tends to reduce fragmentation by buffering many small changes into one larger change. All the files in a transaction group are placed on the disk immediately adjacent to the position of the previous transaction. these can be deleted if the transaction aborts. such as those belonging to a source code archive.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS faster access to the current data than is possible if it were stored on an optical disk alone. Records can also be updated without consuming disk space. The CDFS contains three types of "files": regular files. If a file is updated by writing a new copy of it at some later time.htm (12 of 18)7/3/2004 4:19:45 PM . the CDFS does not provide a structure for the organization of records. Each file is given a unique sequence number for the file system and a version number.e. The splitting policy of a TSBT can be skewed to favor different objectives.4 Compact Disk File System The Compact Disk File System (CDFS) of Garfinkel (1986) is a system independent hierarchical file system for WORM optical disks. If the total amount of storage space consumed is a concern. 6." A transaction results from the process of writing a complete group of files on the optical disk. Each individual file is stored contiguously. care must be taken to recognize that the size of the tree is limited not by the capacity of the optical disk. Unlike the write-once and time-split B-trees.

The example is for two transactions for a CDFS consisting of three files. This explicit information allows the contents of a single disk employing a CDFS to appear to be a native file system on more than one operating system (with appropriate drivers for each system). stores a large amount of explicit information about the file. between the directory list and the directories (in this case just one. This would allow portions of the file to be changed while other parts remained intact. coupled with the access it allows to all previous versions of a file. For example. the file header contains the name of the owner of the file. A directory is a special type of file that contains entries identifying other files known as members. The header. a file header and a file body. Any change to a single file requires the entire file to be rewritten along with a new copy of the directory list. The directory list stored at the end of the files in the transaction is an optimization to reduce seeks. (backward) pointers exist between the EOT records.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS Each stored file entry consists of two parts. which is invisible to a user. For example. The robustness of the CDFS inherent in the degree of redundancy found in the organization. Using the directory list improves performance by reducing the seeks needed to traverse the file directory tree. It is a list of the positions of all current directories and subdirectories in the hierarchical file system. These entries include pointers to the disk positions of the members and other information such as file sizes and modification times. between each EOT record and its directory list. Extra information could be added to the file header to allow files to be stored noncontiguously. UNIX) this information must be derived by consulting a system database. the root).htm (13 of 18)7/3/2004 4:19:45 PM . The arrows in the diagram represent the pointers which link the various constituents of the CDFS together. This extra information is redundant since it is also stored in the file header..4: State of compact disk file system after two transactions The CDFS is an efficient means of organizing an archive of a hierarchical file system. The diagram in Figure 6. Figure 6. makes it ideal for use in storing file archives. and between the root and the three files.. This is an attempt by the CDFS to span the entire space of file charecteristics that any given operating system might record or require. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The second transaction is used to store a second expanded version of the second file in the file system. but it serves to improve the performance of directory list operations by eliminating the seeks required to access each member. A link entry is simply a pointer to a file or a directory and allows a file to be a member of more than one directory. on some systems (e..4 illustrates how an instance of a CDFS is stored in a series of disk sectors. Its main drawback is that it does not allow efficient updates to files.g.ooks_Algorithms_Collection2ed/books/book5/chap06.

and appear to an operating system just as if it was any other magnetic disk file system." A continuation list consists of a series of disk blocks linked together via forward pointers.htm (14 of 18)7/3/2004 4:19:45 PM . Its goals are quite different from those of the CDFS described previously. It would also appear that if any of the disk sectors in which the time-stamp list is stored were ever damaged (e.. for example. Figure 6. The current version of the FST is found by traversing the time-stamp list to find its last element. To translate between logical and physical blocks.5). by a permanent scratch or temporary dirt). It would be possible. The end of the list is detected when an attempt to read the next disk block in the list fails.. it is assumed that the reason for the failure is because the preallocated (and empty) sector was read.5: File system tree (FST) for optical file cabinet Both the interior nodes and the leaves of the tree are buffered in main memory.5 The Optical File Cabinet The Optical File Cabinet (OFC) of Gait (1988) is another file system for WORM optical disks. Each flush results in a new version of the FST residing on the WORM disk. The system would interpret the bad sector read as the end of the time-stamp list and access historical contents of the file system rather than the current contents. The use of forward pointers by the Optical File Cabinet file system seems to be a major flaw. 6. The time-stamp list is implemented by a "continuation list. Its main objective is to use a WORM disk to simulate an erasable file system such as that found on a magnetic disk. The buffers are periodically flushed (e. In the OFC. the CDFS is also an excellent organization for data interchange.g. It does this by creating a logical disk block space which can be accessed and modified at random on a block-by-block basis through a conventional file system interface.3. the entire organization would be crippled. The mapping between the logical and physical blocks is provided by a structure called the File System Tree (FST) which resides on the WORM optical disk. The pointer in the last element of a continuation list contains a pointer to a preallocated but empty disk block.g. (see Figure 6. The roots of each of the different versions of the FST are pointed to by members of a list also residing on the optical disk called the time-stamp list. the logical block number is used to find a path through the FST to a leaf. every 5 minutes) and written on the optical disk in a process called checkpointing.. to copy a complete UNIX file system to an optical disk employing the CDFS and then transport it to an appropriate VMS installation and access the archived UNIX file system as if it were a native VMS file system..ooks_Algorithms_Collection2ed/books/book5/chap06. the leaves of the tree are the physical disk blocks of the WORM optical disk. At the next checkpoint time it would attempt to write to the damaged sector and file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS Being relatively system independent. forward pointers are unreliable on WORM optical disks. As discussed previously.

Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS find itself in a dead end. where an organization such as the Compact Disk File System would be a better choice. is not really a viable option for a WORM disk. both are read as bad sectors.g. will reduce the number of seeks required to retrieve the contents of a hash bucket.. This. 6. the largest group of records belonging to a single hash bucket are removed (flushed) from the buffer and stored as a contiguous unit on the WORM optical disk.g. The most appropriate application seems to be in fault-tolerant systems that require their entire file systems to be permanently checkpointed at frequent intervals. Since successive insertions are unlikely to be to the same hash bucket. Even if none of the current sectors of the time-stamp list are damaged. When the buffer is full.htm (15 of 18)7/3/2004 4:19:45 PM .6 Buffered Hashing (Bhash) Buffered Hashing (Bhash) of Christodoulakis and Ford (1989a) is a Hash file organization for WORM optical disks which employs a rewritable buffer to obtain performance improvements and reductions in storage space consumption..3. As a replacement for magnetic disks it is an expensive substitute.. By having a larger insertion. The rewritable buffer employed can be either main memory or a magnetic disk. in turn. The space in the buffer freed by the flush is then reused to store more records belonging to any hash bucket. The utility of the Optical File Cabinet file system is also difficult to establish. source code). the degree to which the contents of a bucket are spread around the disk as a function of the number of record insertions is reduced. For systems with less demanding requirements the Write-once B-tree or the Time-split B-tree implemented for specific files might be more appropriate. The same is true for the archiving of copies of specific files (e. This group is linked to a list on the optical disk of other such groups that were flushed previously and that belong to the same bucket. the list could still encounter a preallocated sector that is damaged (e. as the write-once disks eventually fill up and must be replaced. This is important because preallocating space for a hash bucket. they are first stored in the buffer and linked into a list of other records that belong to their hash bucket.. it is likely that the contents of a hash bucket will be stored in different (unpredictable) spots spread around the disk.ooks_Algorithms_Collection2ed/books/book5/chap06. media defect). The buffer helps to reduce sector fragmentation. The length of the list (number of different groups) on the WORM disk will determine the number of file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. This is because it is difficult to distinguish between a preallocated empty disk sector and an occupied but damaged sector. An error such as this would require the entire time-stamp list to be copied and a scheme for identifying and locating alternate time stamp lists on the disk. As records are added to the file. as is done on magnetic disks. but its main purpose is to group many small changes to the file (record insertions) into one larger insertion.

resulting in very infrequent merges but more seeks. access to all previous versions of the database is possible. it too will be merged. Because pointers to all previous groups on the disk are available and can be stored on a magnetic disk (such as the one being used for the buffer).htm (16 of 18)7/3/2004 4:19:45 PM . the groups of the list are merged together to form a single larger contiguous group.. Retrieval performance is affected primarily by the length limits placed on the group lists. The Bhash file organization is a good method for implementing a hash file organization on a WORM optical disk. giving access to all previous versions of each record ever stored in the hash file. resulting in many merges and increased storage consumption. to which subsequent bucket flushes will be appended. If high performance is desired. If the buffer is large. and is used in subsequent retrievals of the bucket. If low storage consumption is required. Because all the data to be stored on a CD-ROM disk is available at the time the disk is created. so flushes will occur more often. The parameters of the organization can be adjusted to obtain either fast retrieval performance at the expense of high storage space consumption or low disk space consumption at the expense of slower retrieval performance. as presented in Christodoulakis and Ford (1989b). Such a tree is called a Balanced Implicit Multiway Tree or a BIM tree.7 Balanced Implicit Multiway Trees (BIM Trees) The static nature of data stored on a CD-ROM optical disk allows conventional B-tree structures to be fine tuned to produce a completely balanced multiway tree that does not require storage space for pointers. it will obviously require less flushing and hence fewer merges will occur and less space will be consumed. this will consume more space since it will increase the number of merges and in turn the number of redundantly stored records. It can be tuned by an implementor to make efficient use of storage space or to meet strict retrieval performance demands. The size of the buffer and the number of buckets also plays a role in determining the performance of the organization. then we can expect that the size of a group flushed from the buffer will be small. When the length of that list exceeds the limit. It also can function as a roll-back database. 6. If adding a new group to the list would make the length exceed some retrieval performance limit on the number of seeks required to access a bucket.. This group is stored on the WORM disk as well. no pointers need to be stored since it is easy to compute the position on the disk for a given node in the tree. the allowed length of the lists will be very short. The new bucket group also becomes the first member of a new group list.3.ooks_Algorithms_Collection2ed/books/book5/chap06. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Deleted records are either removed from the buffer if they have not yet been flushed at the time of the deletion. If the number of buckets is very large. or are marked as deleted by storing a deletion record in the bucket. And given that the size of all nodes of the tree and their layout is known. the allowed length of the lists will be quite long.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS seeks required to complete the retrieval of a bucket. it is possible to preconstruct a perfectly balanced multitree for the data.

Paper presented at the annual meeting of the Special Interest Group for Information Retrieval of the Association of Computing Machinery (ACM SIGIR'89). This is particularly true for CD-ROM disks on which the data never changes. as they have relatively slow seek times.ooks_Algorithms_Collection2ed/books/book5/chap06. June. Portland. CHRISTODOULAKIS.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS When constructing a BIM tree. to the second level.. and D. June. 6. A. The biggest complication in a hashing organization.. Retrieval Performance Versus Disc Space Utilization on WORM Optical Discs. To determine the position of each bucket. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. On such disks.8 Hashing for CD-ROM and CLV Format Optical Disks The spiral track that is characteristic of Constant Linear Velocity (CLV) format optical disks such as CDROM is ideal for implementing hashing file organizations and can guarantee single seek access as discussed in Christodoulakis and Ford (1989b).3.htm (17 of 18)7/3/2004 4:19:45 PM ." IBM Journal of Research and Development. FORD. EASTON. A. 1989a. the contents of a bucket can be accessed with a single seek. If the node size is chosen such that a parent node and all of its children can fill within the viewing window of the objective lens of the disk drive. would be required to retrieve any record within a three-level BIM tree. Oregon. With a spiral track there is no need to impose an arbitrary physical limit (up to the capacity of the disk) on a bucket. Paper presented at the annual meeting of the Special Interest Group for the Management of Data of the Association of Computing Machinery (ACM SIGMOD'89). S. 306-14. For example. This last feature is particularly attractive for optical disks. it will be possible to eliminate the seek required to traverse the link between the parent and one of its children. is the resolution of hash bucket overflows. 1986. Buckets overflow because they are usually associated with a physical division on the storage device that has a finite capacity. The spiral track.. 30(3). a small bucket position translation table recording the beginning position of each bucket can be used. and D. allows hash buckets to be as large or as small as is necessary to store their contents. only a single seek. which can be read continuously for the entire capacity of CLV format disks. with the root buffered in main memory and each of the second level nodes stored with all of their children within the viewing window of the objective lens. and the biggest source of performance degradation. Cambridge. 230-41. either a track or a cylinder. hash buckets can be laid out along the spiral track one after another. File Organizations and Access Methods for CLV Optical Disks.. 1989b. C. S. M. REFERENCES CHRISTODOULAKIS. it is possible to choose node sizes and layouts that will improve the expected retrieval performance for accesses from the tree. FORD. Massachusetts. With the translation table in main memory. "Key-Sequence Data Sets on Indelible Storage.

ooks_Algorithms_Collection2ed/books/book5/chap06. LOMET. Access Method for Multiversion Data. GARFINKEL. June. 321-43.. "A File System For Write-Once Media." Computer. Philosophical Magazine. Go to Chapter 7 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. S. Oregon. 3. SALZBERG 1989. D. Portland. "The Optical File Cabinet: A Random-Access File System for Write-Once Optical Disks. Paper presented at the annual meeting of the Special Interest Group for the Management of Data of the Association of Computing Machinery (ACM SIGMOD'89).. 11-22. L. 1986. 21 (6). KERR. J. 315-24.. September. J." Technical Report MIT Media Lab. 1988.Information Retrieval: CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS GAIT. "On the Rotation of the Plane of Polarization by Reflection from the Pole of a Magnet. 1877.htm (18 of 18)7/3/2004 4:19:45 PM . and B.

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS
Christopher Fox AT&T Bell Laboratories, Holmdel, NJ 07733 Abstract Lexical analysis is a fundamental operation in both query processing and automatic indexing, and filtering stoplist words is an important step in the automatic indexing process. This chapter presents basic algorithms and data structures for lexical analysis, and shows how stoplist word removal can be efficiently incorporated into lexical analysis.

7.1 INTRODUCTION
Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. Tokens are groups of characters with collective significance. Lexical analysis is the first stage of automatic indexing, and of query processing. Automatic indexing is the process of algorithmically examining information items to generate lists of index terms. The lexical analysis phase produces candidate index terms that may be further processed, and eventually added to indexes (see Chapter 1 for an outline of this process). Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. Lexical analysis of a query produces tokens that are parsed and turned into an internal representation suitable for comparison with indexes. In automatic indexing, candidate index terms are often checked to see whether they are in a stoplist, or negative dictionary. Stoplist words are known to make poor index terms, and they are immediately removed from further consideration as index terms when they are identified. This chapter discusses the design and implementation of lexical analyzers and stoplists for information retrieval. These topics go well together because, as we will see, one of the most efficient ways to implement stoplists is to incorporate them into a lexical analyzer.

7.2 LEXICAL ANALYSIS
7.2.1 Lexical Analysis for Automatic Indexing
The first decision that must be made in designing a lexical analyzer for an automatic indexing system is: What counts as a word or token in the indexing scheme? At first, this may seem an easy question, and there are some easy answers to it--for example, terms consisting entirely of letters should be tokens. Problems soon arise, however. Consider the following: Digits--Most numbers do not make good index terms, so often digits are not included as tokens. However, certain numbers in some kinds of databases may be important (for example, case numbers in a legal database). Also, digits are often included in words that should be index terms, especially in databases containing technical documents. For example, a database about vitamins would contain important tokens like "B6" and "B12." One partial (and easy) solution to the last problem is to allow tokens to include digits, but not to begin with a digit. Hyphens--Another difficult decision is whether to break hyphenated words into their constituents, or to keep them as a single token. Breaking hyphenated terms apart helps with inconsistent usage (e.g., "state-of-the-art" and "state of the art" are
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (1 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

treated identically), but loses the specificity of a hyphenated phrase. Also, dashes are often used in place of ems, and to mark a single word broken into syllables at the end of a line. Treating dashes used in these ways as hyphens does not work. On the other hand, hyphens are often part of a name, such as "Jean-Claude," "F-16," or "MS-DOS." Other Punctuation--Like the dash, other punctuation marks are often used as parts of terms. For example, periods are commonly used as parts of file names in computer systems (e.g., "COMMAND.COM" in DOS), or as parts of section numbers; slashes may appear as part of a name (e.g., "OS/2"). If numbers are regarded as legitimate index terms, then numbers containing commas and decimal points may need to be recognized. The underscore character is often used in terms in programming languages (e.g., "max_size" is an identifier in Ada, C, Prolog, and other languages). Case--The case of letters is usually not significant in index terms, and typically lexical analyzers for information retrieval systems convert all characters to either upper or lower case. Again, however, case may be important in some situations. For example, case distinctions are important in some programming languages, so an information retrieval system for source code may need to preserve case distinctions in generating index terms. There is no technical difficulty in solving any of these problems, but information system designers must think about them carefully when setting lexical analysis policy. Recognizing numbers as tokens adds many terms with poor discrimination value to an index, but may be a good policy if exhaustive searching is important. Breaking up hyphenated terms increases recall but decreases precision, and may be inappropriate in some fields (like an author field). Preserving case distinctions enhances precision but decreases recall. Commercial information systems differ somewhat in their lexical analysis policies, but are alike in usually taking a conservative (recall enhancing) approach. For example, Chemical Abstracts Service, ORBIT Search Service, and Mead Data Central's LEXIS/NEXIS all recognize numbers and words containing digits as index terms, and all are case insensitive. None has special provisions for most punctuation marks in most indexed fields. However, Chemical Abstracts Service keeps hyphenated words as single tokens, while the ORBIT Search Service and LEXIS/NEXIS break them apart (if they occur in title or abstract fields). The example we use to illustrate our discussion is simple so it can be explained easily, and because the simplest solution often turns out to be best. Modifications to it based on the considerations discussed above are easy to make. In the example, any nonempty string of letters and digits, not beginning with a digit, is regarded as a token. All letters are converted to lower case. All punctuation, spacing, and control characters are treated as token delimiters.

7.2.2 Lexical Analysis for Query Processing
Designing a lexical analyzer for query processing is like designing one for automatic indexing. It also depends on the design of the lexical analyzer for automatic indexing: since query search terms must match index terms, the same tokens must be distinguished by the query lexical analyzer as by the indexing lexical analyzer. In addition, however, the query lexical analyzer must usually distinguish operators (like the Boolean operators, stemming or truncating operators, and weighting function operators), and grouping indicators (like parentheses and brackets). A lexical analyzer for queries should also process certain characters, like control characters and disallowed punctuation characters, differently from one for automatic indexing. Such characters are best treated as delimiters in automatic indexing, but in query processing, they indicate an error. Hence, a query lexical analyzer should flag illegal characters as unrecognized tokens. The example query lexical analyzer presented below recognizes left and right parentheses (as grouping indicators), ampersand, bar, and caret (as Boolean operators), and any alphanumeric string beginning with a letter (as search terms). Spacing characters are treated as delimiters, and other characters are returned as unrecognized tokens. All uppercase characters are converted to lowercase.

7.2.3 The Cost of Lexical Analysis
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (2 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

Lexical analysis is expensive because it requires examination of every input character, while later stages of automatic indexing and query processing do not. Although no studies of the cost of lexical analysis in information retrieval systems have been done, lexical analysis has been shown to account for as much as 50 percent of the computational expense of compilation (Wait 1986). Thus, it is important for lexical analyzers, particularly for automatic indexing, to be as efficient as possible.

7.2.4 Implementing a Lexical Analyzer
Lexical analysis for information retrieval systems is the same as lexical analysis for other text processing systems; in particular, it is the same as lexical analysis for program translators. This problem has been studied thoroughly, so we ought to adopt the solutions in the program translation literature (Aho, Sethi, and Ullman 1986). There are three ways to implement a lexical analyzer: Use a lexical analyzer generator, like the UNIX tool lex (Lesk 1975), to generate a lexical analyzer automatically; Write a lexical analyzer by hand ad hoc; or Write a lexical analyzer by hand as a finite state machine. The first approach, using a lexical analyzer generator, is best when the lexical analyzer is complicated; if the lexical analyzer is simple, it is usually easier to implement it by hand. In our discussion of stoplists below, we present a special purpose lexical analyzer generator for automatic indexing that produces efficient lexical analyzers that filter stoplist words. Consequently, we defer further discussion of this alternative. The second alternative is the worst. An ad hoc algorithm, written just for the problem at hand in whatever way the programmer can think to do it, is likely to contain subtle errors. Furthermore, finite state machine algorithms are extremely fast, so ad hoc algorithms are likely to be less efficient. The third approach is the one we present in this section. We assume some knowledge of finite state machines (also called finite automata), and their use in program translation systems. Readers unfamiliar with these topics can consult Hopcroft and Ullman (1979), and Aho, Sethi, and Ullman (1986). Our example is an implementation of a query lexical analyzer as described above. The easiest way to begin a finite state machine implementation is to draw a transition diagram for the target machine. A transition diagram for a machine recognizing tokens for our example query lexical analyzer is pictured in Figure 7.1. In this diagram, characters fall into ten classes: space characters, letters, digits, the left and right parentheses, ampersand, bar, caret, the end of string character, and all other characters. The first step in implementing this finite state machine is to build a mechanism for classifying characters. The easiest and fastest way to do this is to preload an array with the character classes for the character set. Assuming the ASCII character set, such an array would contain 128 elements with the character classes for the corresponding ASCII characters. If such an array is called char_class, for example, then the character class for character 'c' is simply char_class [c]. The character classes themselves form a distinct data type best declared as an enumeration in C. Figure 7.2 contains C declarations for a character class type and array. (Note that the end of file character requires special treatment in C because it is not part of ASCII).

Figure 7.1: Transition diagram for a query lexical analyzer

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (3 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

The same technique is used for fast case conversion. In Figure 7.2, an array of 128 characters called convert_case is preloaded with the printing characters, with lowercase characters substituted for uppercase characters. Nonprinting character positions will not be used, and are set to 0. /************** Character Classification *****************/ /* Tokenizing requires that ASCII be broken into character */ /* classes distinguished for tokenizing. White space /* characters separate tokens. Digits and letters make up /* the body of search terms. Parentheses group sub/* expressions. The ampersand, bar, and caret are /* operator symbols. typedef enum { WHITE_CH, DIGIT_CH, LETTER_CH, LFT_PAREN_CH, RGT_PAREN_CH, AMPERSAND_CH, BAR_CH, CARET_CH, EOS_CH, OTHER_CH, } CharClassType; static CharClassType char_class[128] = { /* ^@ */ /* ^C */ /* ^F */ EOS_CH, OTHER_CH, OTHER_CH, /* ^A */ /* ^D */ /* ^G */ OTHER_CH, OTHER_CH, OTHER_CH, /* ^B */ /* ^E */ /* ^H */ OTHER_CH, OTHER_CH, WHITE_CH, /* whitespace characters */ /* the digits */ /* upper and lower case */ /* the "(" character */ /* the ")" character */ /* the "&" character */ /* the "|" character */ /* the "^" character */ /* the end of string character */ /* catch-all for everything else */ */ */ */ */ */

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (4 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

/* Î */ /* ^L */ /* Ô */ /* ^R */ /* Û */ /* ^X */ /* ^[ */ /* ^^ */ /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* ! */ $ */ ' */ * */ - */ 0 */ 3 */ 6 */ 9 */ < */ ? */ B */ E */ H */ K */ N */

WHITE_CH, WHITE_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, DIGIT_CH, DIGIT_CH, DIGIT_CH, DIGIT_CH, OTHER_CH, OTHER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH,

/* ^J */ /* ^M */ /* ^P */ /* ^S */ /* ^V */ /* ^Y */ /* ^\ */ /* ^_ */ /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* " */ % */ ( */ + */ . */ 1 */ 4 */ 7 */ : */ = */ @ */ C */ F */ I */ L */ O */

WHITE_CH, WHITE_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH,

/* ^K */ /* ^N */ /* ^Q */ /* ^T */ /* ^W */ /* ^Z */ /* ^] */ /* /* /* */ # */ & */ ) */ , */ / */ 2 */ 5 */ 8 */ ; */ > */ A */ D */ G */ J */ M */ P */

WHITE_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, OTHER_CH, WHITE_CH, OTHER_CH, AMPERSAND_CH, RGT_PAREN_CH, OTHER_CH, OTHER_CH, DIGIT_CH, DIGIT_CH, DIGIT_CH, OTHER_CH, OTHER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH,

LFT_PAREN_C, /* OTHER_CH, OTHER_CH, DIGIT_CH, DIGIT_CH, DIGIT_CH, OTHER_CH, OTHER_CH, OTHER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, /* /* /* /* /* /* /* /* /* /* /* /* /*

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (5 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

/* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /*

Q */ T */ W */ Z */ ] */ ` */ c */ f */ i */ l */ o */ r */ u */ x */ { */ */

LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, OTHER_CH, OTHER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, OTHER_CH, OTHER_CH,

/* /* /* /* /* /* /* /* /* /* /* /* /* /* /*

R */ U */ X */ [ */ ^ */ a */ d */ g */ j */ m */ p */ s */ v */ y */ | */

LETTER_CH, LETTER_CH, LETTER_CH, OTHER_CH, CARET_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, BAR_CH, OTHER_CH,

/* /* /* /* /* /* /* /* /* /* /* /* /* /* /*

S */ V */ Y */ \ */ _ */ b */ e */ h */ k */ n */ q */ t */ w */ z */ } */

LETTER_CH, LETTER_CH, LETTER_CH, OTHER_CH, OTHER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, LETTER_CH, OTHER_CH, };

/* ^? */

/**************

Character Case Conversion

*************/ */ */ */

/* Term text must be accumulated in a single case. This /* array is used to convert letter case but otherwise /* preserve characters. static char convert_case[128] = { /* ^@ */ /* ^D */ /* ^H */ /* ^L */ 0, 0, 0, 0, /* Â */ /* Ê */ /* Î */ /* ^M */ 0, 0, 0, 0, /* ^B */ /* ^F */ /* ^J */ /* ^N */ 0, 0, 0, 0,

/* ^C */ /* ^G */ /* ^K */ /* ^O */

0, 0, 0, 0,

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (6 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

/* ^P */ /* ^T */ /* ^X */ /* ^\ */ /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* */ $ */ ( */ , */ 0 */ 4 */ 8 */ < */ @

0, 0, 0, 0, ' ', '$', '(', ',', '0', '4', '8', '<',

/* ^Q */ /* ^U */ /* ^Y */ /* ^] */ /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* ! */ % */ ) */ - */ 1 */ 5 */ 9 */ = */ A */ E */ I */ M */ Q */ U */ Y */ ] */ a */ e */ i */ m */

0, 0, 0, 0, '!', '%', ')', '-', '1', '5', '9', '=', 'a', 'e', 'i', 'm', 'q', 'u', 'y', ']', 'a', 'e', 'i', 'm',

/* ^R */ /* ^V */ /* ^Z */ /* ^^ */ /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* " */ & */ * */ . */ 2 */ 6 */ : */ > */ B */ F */ J */ N */ R */ V */ Z */ ^ */ b */ f */ j */ n */

0, 0, 0, 0, '"', '&', '*', '.', '2', '6', ':', '>', 'b', 'f', 'j', 'n', 'r', 'v', 'z', '^', 'b', 'f', 'j', 'n',

/* ^S */ /* ^W */ /* ^[ */ /* ^_ */ /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* # */ ' */ + */ / */ 3 */ 7 */ ; */ ? */ C */ G */ K */ O */ S */ W */ [ */ _ */ c */ g */ k */ o */

0, 0, 0, 0, '#', ''', '+', '/', '3', '7', ';', '?', 'c', 'g', 'k', o', 's', 'w', '[', '_', 'c', 'g', 'k', 'o',

*/ '@', 'd', 'h', 'l', 'p', 't', 'x', '\', '`', 'd', 'h', 'l',

D */ H */ L */ P */ T */ X */ \ */ ` */ d */ h */ l */

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (7 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

/* /* /* /*

p */ t */ x */ | */

'p', 't', 'x', '|',

/* /* /* /*

q */ u */ y */ } */

'q', 'u', 'y', '}',

/* /* /* /*

r */ v */ z */
~

'r', 'v', 'z', '~',

/* /* /*

s */ w */ { */

's', 'w', '{' , 0, };

/* ^? */

/********************

Tokenizing

***********************/

/* The lexer distinguishes terms, parentheses, the and, or */ /* and not operators, the unrecognized token, and the end /* of the input. typedef enum { TERM_TOKEN = 1, /* a search term */ /* left parenthesis */ /* right parenthesis */ /* set intersection connective */ /* set union connective */ /* set difference connective */ /* end of the query */ /* the token is not recognized */ */ */

LFT_PAREN_TOKEN = 2, RGT_PAREN_TOKEN = 3, AND_TOKEN OR_TOKEN NOT_TOKEN END_TOKEN NO_TOKEN } TokenType; = 4, = 5, = 6, = 7, = 8,

Figure 7.2: Declarations for a simple query lexical analyzer There also needs to be a type for tokens. An enumeration type is best for this as well. This type will have an element for each of the tokens: term, left parenthesis, right parenthesis, ampersand, bar, caret, end of string, and the unrecognized token. Processing is simplified by matching the values of the enumeration type to the final states of the finite state machine. The declaration of the token type also appears in Figure 7.2. The code for the finite state machine must keep track of the current state, and have a way of changing from state to state on input. A state change is called a transition. Transition information can be encoded in tables, or in flow of control. When there are many states and transitions, a tabular encoding is preferable; in our example, a flow of control encoding is probably clearest. Our example implementation reads characters from an input stream supplied as a parameter. The routine returns the next token from the input each time it is called. If the token is a term, the text of the term (in lowercase) is written to a term buffer supplied as a parameter. Our example code appears in Figure 7.3.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (8 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

/*FN************************************************************************ GetToken( stream ) Returns: void Purpose: Get the next token from an input stream Plan: Part 1: Run a state machine on the input

Part 2: Coerce the final state to return the token type Notes: Run a finite state machine on an input stream, collecting

the text of the token if it is a term. The transition table for this DFA is the following (negative states are final): State | White Letter ( ) & | ^ EOS Digit Other

------------------------------------------------------------0 1 | | 0 -1 1 1 -2 -1 -3 -1 -4 -1 -5 -1 -6 -1 -7 -1 -8 1 -8 -1

See the token type above to see what is recognized in the various final states. **/ static TokenType GetToken( stream, term ) FILE *stream; char *term; { int next_ch; int state; int i; /* from the input stream */ /* of the tokenizer DFA */ /* for scanning through the term buffer */ /* in: where to grab input characters */ /* out: the token text if the token is a term */

/* Part 1: Run a state machine on the input */
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap07.htm (9 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

state = 0; i = 0; while ( 0 { if ( EOF == (next_ch = getc(stream)) ) next_ch = '\0'; term[i++] = convert_case[next_ch]; switch( state ) { case 0 : switch( char_class[next_ch] ) { case WHITE_CH : case LETTER_CH : i = 0; break; state = 1; break; < = state )

case LFT_PAREN_CH : state = -2; break; case RGT_PAREN_CH : state = -3; break; case AMPERSAND_CH : state = -4; break; case BAR_CH : case CARET_CH : case EOS_CH : case DIGIT_CH : case OTHER_CH : default : } break; state = -5; break; state = -6; break; state = -7; break; state = -8; break; state = -8; break; state =-8; break;

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap07.htm (10 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

case 1 : if ( (DIGIT_CH != char_class[next_ch]) && (LETTER_CH != char_class[next_ch]) ) { ungetc( next_ch, stream ); term[i-1] = '\0'; state = -1; } break; default : state = -8; break; } } /* Part 2: Coerce the final state to return the type token */ return( (TokenType) (-state) ); } /* GetToken */ Figure 7.3: Code for a simple query lexical analyzer The algorithm begins in state 0. As each input character is consumed, a switch on the state determines the transition. Input is consumed until a final state (indicated by a negative state number) is reached. When recognizing a term, the algorithm keeps reading until some character other than a letter or a digit is found. Since this character may be part of another token, it must be pushed back on the input stream. The final state is translated to a token type value by changing its sign and coercing it to the correct type (this was the point of matching the token type values to the final machine states). /*FN*********************************************************************** main( argc, argv ) Returns: int -- 0 on success, 1 on failure Purpose: Program main function Plan: Part 1: Open a file named on the command line

Part 2: List all the tokens found in the file

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap07.htm (11 of 36)7/3/2004 4:19:57 PM

Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

Part 3: Close the file and return Notes: This program simply lists the tokens found in a single file

named on the command line. **/ int main(argc, argv) int argc; char *argv[ ] { TokenType token; char term[128]; FILE *stream; /* next token in the input stream */ /* the term recognized */ /* where to read the data from */ /* in: how many arguments */ /* in: text of the arguments */

if ( (2 != argc) || !(stream = fopen(argv[l],"r")) ) exit(l); do switch( token = GetToken(stream,term) ) { case TERM_TOKEN : (void)printf ( "term: %s\n", term ); break;

case LFT_PAREN_TOKEN : (void)printf ( "left parenthesis\n" ); break; case RGT PAREN-TOKEN : (void)printf ( "right parenthesis\n" ); break; case AND_TOKEN : case OR_TOKEN : case NOT_TOKEN : case END_TOKEN : case NO_TOKEN : (void)printf ( "and operator\n" ); break; (void)printf ( "or operator\n" ); break; (void)printf ( "not operator\n" ); break; (void)printf ( "end of string\n" ); break; (void)printf ( "no token\n" ); break;

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap07.htm (12 of 36)7/3/2004 4:19:57 PM

The code above. it is not clear which words should be included in a stoplist." "from. Eliminating such words from consideration early in automatic indexing speeds processing. Commercial information systems tend to take a very conservative approach." "of." "by. with few stopwords. van Rijsbergen 1975). which is specially constructed to be used with the lexical analysis generator described below. fclose ( stream ).000 words drawn from a broad range of literature in English. break. some frequently occurring words are too important as index terms.htm (13 of 36)7/3/2004 4:19:57 PM ." As with lexical analysis in general. The program reads characters from a file named on the command line. which would also probably call retrieval and display routines.ooks_Algorithms_Collection2ed/books/book5/chap07. A search using one of these terms is likely to retrieve almost every item in a database regardless of its relevance. is to eliminate stopwords during automatic indexing. For example. An even simpler lexical analyzer for automatic indexing can be constructed in the same way.014. For example. Furthermore. a about above across after file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Fox (1990) discusses the derivation of (a slightly shorter version of) this list." "of. specialized databases will contain many words useless as index terms that are not frequent in general English. A list of words filtered out during automatic indexing because they make poor index terms is called a stoplist or a negative dictionary. this code tokenized at about a third the speed that the computer could read characters--about as fast as can be expected. these words make up a large fraction of the text of most documents: the ten most frequently occurring words in English typically account for 20 to 30 percent of the tokens in a document (Francis and Kucera 1982). When tested. As with lexical analysis. then." "water. while ( END_TOKEN != token )." Larger stoplists are usually advisable. however.. the ORBIT Search Service has only eight stopwords: "and." "an.) are worthless as index terms." "life.4: Test program for a query lexical analyzer Figure 7. a computer literature database probably need not use index terms like "computer. An oft-cited example of a stoplist of 250 words appears in van Rijsbergen (1975). and it will be just as fast. augmented with the appropriate include files." and "with.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS default : } (void)printf ( "bad data\n" ). is a complete and efficient implementation of our simple lexical analyzer for queries. One way to improve information retrieval system performance." etc." "source. the tokens returned by the lexical analyzer would be processed by a query parser.5 contains a stoplist of 425 words derived from the Brown corpus (Francis and Kucera 1982) of 1." and "world.3 STOPLISTS It has been recognized since the earliest days of information retrieval (Luhn 1957) that many of the most frequently occurring words in English (like "the." "the.." "to. stoplist policy will depend on the database and features of the users and the indexing process. saves huge amounts of space in indexes." "war. 7." and "language. and writes out a description of the token stream that it finds." "program. so their discrimination value is low (Salton and McGill 1983. In real use. Traditionally. and does not damage retrieval effectiveness." "machine.4 contains a small main program to demonstrate the use of this lexical analyzer. For example." "and. Figure 7. } /* main */ Figure 7." "home. However. included among the 200 most frequently occurring words in general literature in English are "time." On the other hand. stoplists are supposed to have included the most frequently occurring words.

.ooks_Algorithms_Collection2ed/books/book5/chap07..Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS again along among anybody area asked b be been beings both can certainly d do downing early ends every f far first fully g gets against already an anyone areas asking back because before best but cannot clear did does downs either enough everybody face felt for further gave give all also and anything around asks backed became began better by case clearly differ done during end even everyone faces few four furthered general given almost although another anywhere as at backing become behind between c cases come different down e ended evenly everything fact find from furthering generally gives alone always any are ask away backs becomes being big came certain could differently downed each ending ever everywhere facts finds full furthers get go file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (14 of 36)7/3/2004 4:19:57 PM .

..htm (15 of 36)7/3/2004 4:19:57 PM .ooks_Algorithms_Collection2ed/books/book5/chap07.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS going greater groups having high his important interests itself keeps knows later lets longest man members mostly my needed newer not nowhere o older good greatest h he higher how in into j kind l latest like m many men mr myself needing newest nobody number of oldest goods group had her highest however interest is just knew large least likely made may might mrs n needs next noone numbered off on got grouped has herself him i interested it k know largely less long make me more much necessary never no nothing numbering often once great grouping have here himself if interesting its keep known last let longer making member most must need new non now numbers old one file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

ooks_Algorithms_Collection2ed/books/book5/chap07.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS only or other p per pointed presented put rather s says seemed shall showing small somebody states take their these thinks thoughts today turn u open order others part perhaps pointing presenting puts really said second seeming she shows smaller someone still taken them they this three together turned under opened ordered our parted place points presents q right same seconds seems should side smallest something such than then thing those through too turning until opening ordering out parting places possible problem quite room saw see sees show sides so somewhere sure that there things though thus took turns up opens orders over parts point present problems r rooms say seem several showed since some state t the therefore think thought to toward two upon file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..htm (16 of 36)7/3/2004 4:19:57 PM ..

and recognizing even a large stoplist can be done at almost no extra cost during lexical analysis. with a small increase in the cost of lexical analysis.. 1984. Undoubtedly the fastest solution is hashing. Furthermore. the token is not a stopword. if so. comparisons must be made to determine whether the hashed value really matches the entries at that hash table location. Some improvement can also be realized by generating a perfect hashing function for the stoplist (a perfect hashing function for a set of keys hashes the keys with no collisions-see Chapter 13). the list must first be inserted into a hash table. binary search of an array. The first approach. If the resulting location is empty. Although hashing is an excellent approach. but is slowed by the need to re-examine each character in a token to generate its hash value. which is easier and less error-prone than writing stopword filters by hand.. and removed from further analysis if found. If not. makes the stoplist problem into a standard list searching problem: every token must be looked up in the stoplist. which is sure to be large unless the hash table is enormous. This minimizes the overhead of collision resolution. Chapter 13). Since lexical analysis must be done anyway.5: A stoplist for general text 7. otherwise.3. probably the best implementation of stoplists is the second strategy: remove stoplist words as part of the lexical analysis process. the token is a word. The usual solutions to this problem are adequate. Each token is then hashed into the table. filtering stopwords from lexical analyzer output. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS us very wants well when who with working year younger use w was wells where whole within works years youngest uses want way went whether whose without would yet your used wanted ways were which why work x you yours v wanting we what while will worked y young z Figure 7. then the token is passed on. The hashing strategy can be improved by incorporating computation of hash values into the character-by-character processing of lexical analysis. and by the need to resolve collisions. or (b) remove stopwords as part of lexical analysis. and is eliminated from the token stream.ooks_Algorithms_Collection2ed/books/book5/chap07.1 Implementing Stoplists There are two ways to filter stoplist words from an input token stream: (a) examine lexical analyzer output and remove any stopwords. The output of the lexical analysis phase is then a hash value as well as a token. lexical analyzers that filter stoplists can be generated automatically. and is passed on. including binary search trees. When hashing is used to search a stoplist.htm (17 of 36)7/3/2004 4:19:57 PM . This strategy is fast. this approach is extremely efficient. but has no effect on the number of collisions. and hashing (Tremblay and Sorenson.

. and t. An example of a fully constructed machine appears in Figure 7. The lexical analyzer generator accepts an arbitrary list of stopwords. generate the derivative state labels from the label Li for qi.3. the transition on i to a state labeled {n.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS The rest of this chapter presents a lexical analyzer generator for automatic indexing.ooks_Algorithms_Collection2ed/books/book5/chap07. and (c) the states that are final states. 7. suppose a state is labeled with the set of strings {a. create qj and put it in Q. (b) the target state for each transition. the algorithm labels each state with the set of strings the machine would accept if that state were the initial state.) This algorithm is similar to one described by Aho and Corasick (1975) for string searching. This state must have transitions on a. For example. The algorithm presented here is based on methods of generating minimum state deterministic finite automata (DFAs) using derivatives of regular expressions (Aho and Ullman 1975) adapted for lists of strings. }. for each derivative state label Lj with transition a: { if no state qj labelled Lj exists. create an initial state qo and label it with the input set Lo. and. into. create an arc labelled a from qi to qj.. nd. A state is made a final state if and only if its label contains the empty string. to fit other needs. An algorithm for generating a minimum state DFA using this labeling mechanism is presented in Figure 7. an.6.2 A Lexical Analyzer Generator The heart of the lexical analyzer generator is its algorithm for producing a finite state machine. } } make all states whose label contains final states. It should be clear from the code presented here how to elaborate the generator. nto}. The transition on a must go to a state labeled with the set {n. or the driver program. During machine generation. in. while Q is not empty do: { remove state qi from Q. place qo in a state queue Q.htm (18 of 36)7/3/2004 4:19:57 PM . A state label L labeling a target state for a transition on symbol a is called a derivative label L with transition a. to}. (A DFA is minimum state if it has a few states as possible. i. and the transition on t to a state labeled {o}. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.7. It is easy to examine these state labels to determine: (a) the transition out of each state.

. Figure 7. which is very sparse. State labels are hashed to produce a signature for faster label searching. Labels are also kept in a binary search tree by hash signature. transitions are kept in a short list that is searched whenever a transition for a given state is needed.ooks_Algorithms_Collection2ed/books/book5/chap07.6: Algorithm for generating a finite state machine A C language implementation for this algorithm for generating a finite state machine appears in Figure 7. #define HASH_START #define HASH_INCREMENT /************** 5775863 38873647 ****************/ */ */ */ */ */ */ */ */ State Label Binary Search Tree /* During DFA construction. and are incremented by a lot with every new word /* in the list.. so the only data structures of any size that remain are the state table and the compressed transition table. Once the machine is constructed.8.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS Figure 7. The hashing function uses an /* entire integer variable range as its hash table size. a default transition to a special dead state is used. in an /* effort to get a good spread through this range.htm (19 of 36)7/3/2004 4:19:57 PM . When a machine blocks in the dead state. This operation is sped up by hashing the labels /* to a signature value. When a transition for a symbol is not found in the transition table. it does not recognize its input. The collision rate is low using this method. This algorithm relies on simple routines for allocating memory and manipulating lists of strings not listed to save space. all states must be searched by /* their labels to make sure that the minimum number of states /* are used. To save space in the transition table. These lists usually have no more than three or four items. all the space used tor the state labels is deallocated. Several techniques are used to speed up the algorithm and to save space. hash values /* start big.7: An example of a generated finite state machine #define DEAD_STATE #define TABLE_INCREMENT /************************* -1 256 Hashing /* used to block a DFA */ /* used to grow tables */ ****************************/ /* Sets of suffixes labeling states during the DFA construction */ /* are hashed to speed searching. so searching them is still reasonably fast. then storing the signatures and labels */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

ooks_Algorithms_Collection2ed/books/book5/chap07. typedef struct { StrList label. then listed in arc label order. int arc_offset. The tree is destroyed once the DFA /* is fully constructed. a pointer into /* the arc table for these arcs. /* for this state . unsigned signature. /********************* DFA State Table ************************/ */ */ */ */ /* The state table is an array of structures holding a state /* label. a count of the arcs out of the state. struct TreeNode *left. Each state's transitions are offset from /* the start of the table. and a final state flag.. /********************** DFA Arc Table *************************/ */ */ */ */ /* The arc table lists all transitions for all states in a DFA /* in compacted form. /* Transitions are found by a linear search of the sub-section file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. *StateTable. int state. The /* label field is used only during machine construction. typedef struct TreeNode { StrList label. short is_final.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS /* in a binary search tree.used during build /* for this state in the arc table /* for finding arcs in the arc table /* TRUE iff this is a final state */ */ */ */ } StateTableEntry. int num_arcs. struct TreeNode *right. *SearchTree..htm (20 of 36)7/3/2004 4:19:57 PM . /* state label used as search key /* hashed label to speed searching /* whose label is representd by node /* left binary search subtree /* right binary search subtree */ */ */ */ */ */ */ } SearchTreeNode.

int max_arcs. SearchTree tree. typedef struct { char label. } DFAStruct. /* character label on an out-arrow /* the target state for the out-arrow */ */ */ } ArcTableEntry.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS /* of the table for a given state. and bookkeepping /* counters. /* then realloc'd if more space is required. The tables are arrays whose space is malloc'd. int target..htm (21 of 36)7/3/2004 4:19:57 PM . Once a machine is /* constructed.. int max_states. typedef struct { int num_states. int num_arcs. *DFA. ArcTable arc_table.ooks_Algorithms_Collection2ed/books/book5/chap07. the table space is realloc'd one last time to /* fit the needs of the machine exactly. *ArcTable. ********************** DFA Structure *********************** / /* A DFA is represented as a pointer to a structure holding the */ /* machine's state and transition tables. StateTable state_table. /*FN************************************************************************** DestroyTree( tree ) Returns: void /* in the DFA (and state table) /* now allocated in the state table /* in the arc table for this machine /* now allocated in the arc table /* the compacted DFA state table /* the compacted DFA transition table /* storing state labels used in build */ */ */ */ */ */ */ */ */ */ */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

. if ( NULL != tree->right ) DestroyTree ( tree->right ). /* Part 3: Deallocate the root */ tree->left = tree->right = NULL. label.state with the given label Purpose: Search a machine and return the state with a given state label Plan: Part 1: Search the tree for the requested state Part 2: If not found.ooks_Algorithms_Collection2ed/books/book5/chap07. /* Part 2: Deallocate the subtrees */ if ( NULL != tree->left ) DestroyTree ( tree->left ). /* in: search tree destroyed */ None..htm (22 of 36)7/3/2004 4:19:57 PM . (void)free( (char *)tree ). } /* DestroyTree */ /*FN************************************************************************ GetState( machine. signature ) Returns: int -. { /* Part 1: Return right away of there is no tree */ if ( NULL == tree ) return.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS Purpose: Destroy a binary search tree created during machine construction Plan: Part 1: Return right away of there is no tree Part 2: Deallocate the subtrees Part 3: Deallocate the root Notes: **/ static void DestroyTree( tree ) SearchTree tree. add the label to the tree file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

signature) DFA machine. /* in: DFA whose state labels are searched. /* pointer to a search tree link field */ /* for a newly added search tree node */ /* Part 1: Search the tree for the requested state */ ptr = &(machine->tree). **/ static int GetState(machine.ooks_Algorithms_Collection2ed/books/book5/chap07. SearchTree new_node.htm (23 of 36)7/3/2004 4:19:57 PM .Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS Part 3: Return the state number Notes: This machine always returns a state with the given label because if the machine does not have a state with the given label. /* in: state label searched for */ unsigned signature... label. StrList label. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. new-node->signature = signature. /* Part 2: If not found. sizeof(SearchTreeNode) ). add the label to the tree */ if ( NULL == *ptr ) { /* create a new node and fill in its fields */ new_node = (SearchTree)GetMemory( NULL. /* in: signature of the label requested */ { SearchTree *ptr. while ( (NULL != *ptr) && ( (signature != (*ptr)->signature) || !StrListEqual(label. then one is created.(*ptr)->label)) ) ptr = (signature <= (*ptr)->signature) ? &(*ptr)->left : &(*ptr)->right.

} else StrListDestroy( label ).htm (24 of 36)7/3/2004 4:19:57 PM . new_node->left = new_node->right = NULL. /* allocate more states if needed. arc_label. new_node->state = machine->num_states. } machine->state_table[machine-num_states]. machine->state_table = (StateTable)GetMemory(machine-state_table. machine-max_states*sizeof(StateTableEntry)). } /* GetState */ /*FN********************************************************************************* AddArc( machine. machine->num_states++.. /* hook the new node into the binary search tree */ *ptr = new_node. state_label.label = (StrList)label. set up the new state */ if ( machine->num_states == machine->max_states ) { machine->max_states += TABLE_INCREMENT. state_signature ) Returns: void Purpose: Add an arc between two states in a DFA file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. state. /* Part 3: Return the state number */ return( (*ptr)->state ).Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS new_node->label = (StrList)label.ooks_Algorithms_Collection2ed/books/book5/chap07.

state_signature ) DFA machine.. state_label. char arc_label. /* Part 2: Make sure the arc table is big enough */ if ( machine->num_arcs == machine->max_arcs ) { machine->max_arcs += TABLE_INCREMENT. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS Plan: Part 1: Search for the target state among existing states Part 2: Make sure the arc table is big enough Part 3: Add the new arc Notes: None. StrList state_label. machine->max_arcs * sizeof(ArcTableEntry) ). /* destination state for the new arc */ /* in/out: machine with an arc added */ /* in: with an out arc added */ /* in: label on the new arc */ /* in: label on the target state */ /* in: label hash signature to speed searching */ /* Part 1: Search for the target state among existing states */ StrListSort( 0.ooks_Algorithms_Collection2ed/books/book5/chap07. state_signature ). **/ static void AddArc( machine. target = GetState( machine. state. arc_label. machine->arc_table = (ArcTable)GetMemory( machine->arc_table. { register int target. state_label. int state.htm (25 of 36)7/3/2004 4:19:57 PM . state_label ). unsigned state_signature.

**/ DFA BuildDFA( words ) StrList words.label = arc_label. machine->state_table[state].num_arcs++.newly created finite state machine Purpose: Build a DFA to recognize a list of words Plan: Part 1: Allocate space and initialize variables Part 2: Make and label the DFA start state Part 3: Main loop . /* in: that the machine is built to recognize */ { DFA machine.target = target.htm (26 of 36)7/3/2004 4:19:57 PM .Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS } /* Part 3: Add the new arc */ machine->arc_table[machine->num_arcs]. machine->arc_tablel[machine->num_arcs]. /* local for easier access to machine */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } /* AddArc */ / *FN********************************************************************************** BuildDFA( words ) Returns: DFA -..build the state and arc tables Part 4: Deallocate the binary search tree and the state labels Part 5: Reallocate the tables to squish them down Part 6: Return the newly constructed DFA Notes: None.ooks_Algorithms_Collection2ed/books/book5/chap07.. machine->num_arcs++.

machine->num_states = 0. machine->max_arcs * sizeof(ArcTableEntry) ). /* current state's state number */ /* for the current arc when adding arcs */ /* element in a set of state labels */ /* the first character in a new string */ /* set of strings labeling a state */ /* labeling the arc target state */ /* hashed label for binary search tree */ /* for looping through strings */ /* Part 1: Allocate space and initialize variables */ machine = (DFA)GetMemory( NULL. machine->num_arcs = 0. machine->max_states = TABLE_INCREMENT. unsigned target_signature. char arc_label. /* Part 2: Make and label the DFA start state */ StrListUnique( 0. sizeof(DFAStruct) ). state++ ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.build the state and arc tables */ for ( state = 0. machine->max_states*sizeof(StateTableEntry)). machine->num_states = 1..htm (27 of 36)7/3/2004 4:19:57 PM . StrList target_label. StrList current_label. char *string. /* sort and unique the list */ machine->state_table[0]. state < machine->num_states. words ).ooks_Algorithms_Collection2ed/books/book5/chap07. char ch. machine->max_arcs = TABLE_INCREMENT.label = words. machine->tree = NULL.. machine->state_table = (StateTable)GetMemory(NULL. machine->arc_table = (ArcTable)GetMemory ( NULL. register int i.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS register int state. /* Part 3: Main loop .

/* Add arcs to the arc table for the current state /* based on the state's derived set.num_arcs = 0.. so /* the first order of business is to set up some /* of its other major fields machine->state_table[state]. i < StrListSize(current_label). for ( i = 0. ch = *string++.is_final = FALSE.arc_offset = machine->num_arcs.is_final = TRUE.. /* the empty string means mark this state as final */ if ( EOS == ch ) { machine->state_table[state]. continue. } /* make sure we have a legitimate arc_label */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (28 of 36)7/3/2004 4:19:57 PM .ooks_Algorithms_Collection2ed/books/book5/chap07.label. machine->state_table[state]. target_label = StrListCreate(). arc_label = EOS. machine->state_table[state]. i++ ) { /* get the next string in the label and lop it */ string = StrListPeek( current_label. Also set the */ */ */ */ */ */ */ /* state's final flag if the empty string is found /* in the suffix list current_label = machine->state_table[state].Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS { /* The current state has nothing but a label. i ). target_signature = HASH_START.

arc_label. as /* long as the last set of suffixes is not empty /* (which happens when the current state label /* is the singleton set of the empty string). target_signature). target_signature ).htm (29 of 36)7/3/2004 4:19:57 PM */ */ */ */ */ .. state. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. machine->tree = NULL. then we must */ /* add an arc for the previous first character */ if ( ch != arc_label ) { AddArc (machine. target_label = StrListCreate(). state. target_label.ooks_Algorithms_Collection2ed/books/book5/chap07. target_label. string ).Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS if ( EOS == arc_label ) arc_label = ch. arc_label. } /* On loop exit we have not added an arc for the /* last bunch of suffixes. if ( 0 < StrListSize(target_label) ) AddArc( machine. /* if the first character is new. } /* add the current suffix to the target state label */ StrListAppend( target_label. so we must do so. target_signature = HASH_START. target_signature += (*string + l) * HASH_INCREMENT. while ( *string ) target_signature += *string++. } /* Part 4: Deallocate the binary search tree and the state labels */ DestroyTree( machine->tree ). arc_label = ch.

and uses it to filter all stopwords.. machine->arc_table = (ArcTable)GetMemory( machine->arc_table. label = NULL. } /* BuildDFA */ Figure 7. It also converts the case of index terms.label ). /* Part 6: Return the newly constructed DFA */ return( machine ). Figure 7. This program assumes the existence of a character class array like the one in Figure 7. machine->num_states * sizeof(StateTableEntry) ). machine->state_table[i]. machine. size.. It filters numbers.7.htm (30 of 36)7/3/2004 4:19:57 PM .. /*FN************************************************************************* GetTerm( stream. It also assumes there is a character case conversion array like the one in Figure 7. it is easy to generate a simple driver program that uses it to process input.2. i < machine->num_states. returning only legitimate index terms.9 contains an example of such a driver program.2. output ) Returns: Purpose: Plan: char * . i++ ) { StrListDestroy( machine->state_table[i].NULL if stream is exhausted.8: Code for DFA generation After a finite state machine is constructed.ooks_Algorithms_Collection2ed/books/book5/chap07. otherwise output buffer Get the next token from an input stream. filtering stop words Part 1: Return NULL immediately if there is no input Part 2: Initialize the local variables Part 3: Main Loop: Put an unfiltered word into the output buffer file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS for ( i = 0. The driver takes a machine constructed using the code in Figure 7. one for letters. but accepts terms containing digits. except that it has only three character classes: one for digits. } /* Part 5: Reallocate the tables to squish them down */ machine->state_table = (StateTable)GetMemory( machine->state_table. machine->num_arcs * sizeof(ArcTableEntry) ). and one for everything else (called delimiters).

machine. int size. /* Part 3: Main Loop: Put an unfiltered word into the output buffer */ do { /* scan past any leading delimiters */ while ( (EOF != ch ) && file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. register int state. output ) FILE *stream.htm (31 of 36)7/3/2004 4:19:57 PM . Care is also taken to be sure not to overrun the output buffer. If a stop word is recognized in this process. **/ char * GetTerm( stream. /* Part 2: Initialize the local variables */ outptr = output. /* for scanning through the output buffer */ /* current character during input scan */ /* current state during DFA execution */ /* in: source of input characters */ /* in: finite state machine driving process */ /* in: bytes in the output buffer */ /* in/out: where the next token in placed */ /* Part 1: Return NULL immediately if there is no input */ if ( EOF = = (ch = getc(stream)) ) return( NULL ).Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS Part 4: Return the output buffer Notes: This routine runs the DFA provided as the machine parameter. DFA machine.. and collects the text of any term in the output buffer. size. int ch. char *output.. it is skipped. { char *outptr.ooks_Algorithms_Collection2ed/books/book5/chap07.

/* for scanning through arc labels */ /* where the arc label list starts */ /* where the arc label list ends */ arc_start = machine->state_table[state]. } ch = getc( stream ). int arc_end.ooks_Algorithms_Collection2ed/books/book5/chap07. } *outptr++ = convert_case[ch]. if ( DEAD_STATE != state ) { register int i. i < arc_end. /* start the machine in its start state */ state = 0. for ( i = arc_start. /* copy input to output until reaching a delimiter. } } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. break.arc_offset.. state = 0. arc_end = arc_start + machine->state_table[state].num_arcs.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS ((DELIM_CH == char_class[ch]) | | (DIGIT_CH == char_class[ch])) ) ch = getc( stream ).htm (32 of 36)7/3/2004 4:19:57 PM . and also */ /* run the DFA on the input to watch for filtered words */ while ( (EOF != ch) && (DELIM_CH != char_class[ch]) ) { if ( outptr == (output+size-1) ) { outptr = output.label ) { state = machine->arc_table[i].target. i++ ) if ( convert_case[ch] == machine->arc_table[i].. if ( i == arc_end ) state = DEAD_STATE. int arc_start.

ooks_Algorithms_Collection2ed/books/book5/chap07.9 to generate and print all the terms in an input file.8.. or from another input source. } while ( (EOF != ch) && !*output ). The input can be read from a stream. argv ) Returns: Purpose: Plan: int -.is_final ) outptr = output. A lexical analyzer can be generated at indexing time. Figure 7.. and different stoplists and lexical analysis rules used in each one. the string is not recognized. The driver program takes advantage of this fact by not running the finite state machine once it enters the dead state.10 contains the main function for a program that reads a stoplist from a file. /* terminate the output buffer */ *outptr = EOS.9: An example of DFA driver program Once the finite state machine blocks in the dead state. as in the example driver program. A lexical analyzer data structure can be defined. then uses the driver function from Figure 7. then different analyzers can be run on different sorts of data.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS /* start from scratch if a stop word is recognized */ if ( (DEAD_STATE != state) && machine->state_table[state]. } /* GetTerm */ Figure 7. A lexical analyzer generator program can use these components in several ways.0 on success. As an illustration. builds a finite state machine using the function from Figure 7. 1 on failure Program main function Part 1: Read the stop list from the stop words file Part 2: Create a DFA from a stop list Part 3: Open the input file and list its terms file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. All these alternatives are easy to implement once the basic finite state machine generator and driver generator programs are in place. or ahead of time and stored in a file.htm (33 of 36)7/3/2004 4:19:57 PM . /*FN************************************************************************** main( argc. filtering out the words in the stoplist. /* Part 4: Return the output buffer */ return( output ).

DFA machine. term ). term) ) (void)printf( "%s/n". /* Part 2: Create a DFA from a stop list */ machine = BuildDFA( words ). StrList words. /* Part 4: Close the input file and return */ (void)fclose( stream ). FILE *stream.std" ). char *argv[]. argv ) int argc. { char term[128]. /* for the next term found */ /* where to read characters from */ /* the stop list filtered */ /* build from the stop list */ /* in: how many arguments */ /* in: text of the arguments */ /* Part 1: Read the stop list from the stop words file */ words = StrListCreate( ). "words. machine.ooks_Algorithms_Collection2ed/books/book5/chap07.std. 128. while ( NULL != GetTerm(stream.htm (34 of 36)7/3/2004 4:19:57 PM . **/ int main( argc.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS Part 4: Close the input file and return Notes: This program reads a stop list from a file called "words. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. "r")) ) exit(1).. /* Part 3: Open the input file and list its terms */ if ( !(stream = fopen(argv[1]. StrListAppendFile( words." and uses it in generating the terms in a file named on the command line.

KUCERA. New York: McGraw-Hill. An Introduction to Data Structures with Applications. 1986. it is a simple matter to apply techniques from program translation to the problem of lexical analysis. "Efficient String Matching: An Aid to Bibliographic Search. SETHI. must be made in the design phase based on characteristics of databases and uses of the target system. Languages. FRANCIS. TREMBLAY. C. the finite state machine generated by the code in Figure 7. } /* main */ Figure 7.4 SUMMARY Lexical analysis must be done in automatic indexing and in query processing.J. it is probably best to incorporate stoplist processing into lexical analysis. 19-35. P. Introduction to Automata Theory. LESK.10 built this finite state machine from scratch. or can be incorporated into the lexical analysis process.htm (35 of 36)7/3/2004 4:19:57 PM . 1982." Communications of the ACM. 24(1-2). 1984. New York: Addison-Wesley. Murray Hill. and J. then used it to lexically analyze the text from this chapter in under 1 second on a Sun SparcStation 1.. HOPCROFT. LUHN. For example. 1(4). MCGILL. P. CORASICK. 1957. and are small and fast. and J. 18(6). 7. New York: McGraw-Hill. A. 1983. and likewise depend on the characteristics of the database and the system.10: Main function for a term generator program Lexical analyzers built using the method outlined here can be constructed quickly. FOX.. Compilers: Principles. New York: Houghton Mifflin.5 has only 318 states and 555 arcs. W. 333-40.. Frequency Analysis of English Usage. N. H. 1975. and of query operators and grouping indicators. Modern Information Retrieval. 1990. Finite state machine based lexical analyzers can be built quickly and reliably by hand or with a lexical analyzer generator. "A Statistical Approach to Mechanized Encoding and Searching of Literary Information. A. J. 1979. and Computation. 1975. G. AHO. "A Stop List for General Text. REFERENCES AHO.ooks_Algorithms_Collection2ed/books/book5/chap07. ULLMAN. and P.: AT&T Bell Laboratories. Second Edition. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. and Tools. Techniques. SALTON.Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS return(0).. Although hashing can be used to solve the searching problem very efficiently. and M. Problems in the selection of stoplist words for automatic indexing are similar to those encountered in designing lexical analyzers. SORENSON. Important decisions about the lexical form of indexing and query terms..8 for the stoplist of 425 words in Figure 7. New York: Addison-Wesley. Lex-A Lexical Analyzer Generator. IBM Journal of Research and Development. M.. Since stoplists may be large. and H. ULLMAN. Once this is done.." SIGIR Forum. and M. R. The program in Figure 7. automatic generation of lexical analyzers is the preferred approach in this case. Removing stoplist words during automatic indexing can be treated like a search problem. J..

Go to Chapter 8 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. London: Butterworths. 1986. 473-88. 16(5). J.htm (36 of 36)7/3/2004 4:19:57 PM . W.ooks_Algorithms_Collection2ed/books/book5/chap07." Software Practice and Experience..Information Retrieval: CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS VAN RIJSBERGEN. Information Retrieval. "The Cost of Lexical Analysis.. C. WAIT. 1975.

Since a single stem typically corresponds to several full terms. As can be seen in Figure 1. These methods are described below. via programs called stemmers. as the general term for the process of matching morphological term variants.. by storing stems instead of terms. The Porter stemmer is described in detail. terms can be stemmed at indexing time or at search time. If. Sterling.1: Conflation methods file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Terms and their corresponding stems can also be stored in a table. and a full implementation in C is presented. The advantage of stemming at indexing time is efficiency and index file compression--since index terms are already stemmed. compression factors of over 50 percent can be achieved. meaning the act of fusing or combining.2 in Chapter 1.htm (1 of 42)7/3/2004 4:20:13 PM . Figure 8. Several approaches to stemming are described--table lookup. These algorithms sometimes also transform the resultant stem. Figure 8. VA 22170 Abstract This chapter describes stemming algorithms--programs that relate morphologically similar indexing and search terms. The n-gram method conflates terms based on the number of digrams or n-grams they share. There are four automatic approaches. Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. B. affix removal. We use the term conflation. Stemming is then done via lookups in the table. The name stemmer derives from this method. Stemming is also used in IR to reduce the size of index files. for example. and n-gram. Stemming is used to improve retrieval effectiveness and to reduce the size of indexing files. it is likely that he or she will also be interested in such variants as stemmed and stem. Successor variety stemmers use the frequencies of letter sequences in a body of text as the basis of stemming. successor variety. Conflation can be either manual--using some kind of regular expressions--or automatic.Books_Algorithms_Collection2ed/books/book5/chap08. Frakes Software Engineering Guild. or additional storage will be required to store both the stemmed and unstemmed forms. and the index file will be compressed as described above. which is the most common.1 shows a taxonomy for stemming algorithms. this operation requires no resources at search time. 8.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS CHAPTER 8: STEMMING ALGORITHMS W.. Empirical studies of stemming are summarized.1 INTRODUCTION One technique for improving IR performance is to provide searchers with ways of finding morphological variants of search terms. a searcher enters the term stemming as part of a query. The disadvantage of indexing time stemming is that information about the full terms will be lost.

using users Occurrences 15 1 3 2 Which terms (0 = none. Finally. Overstemming can cause unrelated terms to be conflated. It also allows experienced file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. At the prompt.. CATALOG takes each term in the query. as described below. CR = all): The user selects the terms he or she wants by entering their numbers. consider the following example from the CATALOG system (Frakes 1984. though the stems they produce are usually very similar to root morphemes. used 4. too much of it is removed. In the case of the query term "users. Understemming is the removal of too little of a term. 1986). and tries to determine which other terms in the database might have the same stem. users 3." for example. Understemming will prevent related terms from being conflated. retrieval effectiveness. In CATALOG. Stemmers for IR are not usually judged on the basis of linguistic correctness. and so on. This method of using a stemmer in a search session provides a naive system user with the advantages of term conflation while requiring little knowledge of the system or of searching techniques. the user types in one or more terms of interest.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS There are several criteria for judging stemmers: correctness. CATALOG prompts for queries with the string "Look for:". size. they can be rated on their compression performance.Books_Algorithms_Collection2ed/books/book5/chap08. For example: Look for: system users will cause CATALOG to attempt to find documents about system users.1 Example of Stemmer Use in Searching To illustrate how a stemmer is used in searching. and compression performance.htm (2 of 42)7/3/2004 4:20:13 PM . When a term is overstemmed.. user 2. CATALOG might respond as follows: Search Term: Term 1. The effect of understemming on IR performance is that relevant documents will not be retrieved. The effect on IR performance is retrieval of nonrelevant documents.1. Stemmers can also be judged on their retrieval effectiveness--usually measured with recall and precision as defined in Chapter 1. and on their speed. 8. CATALOG presents them to the user for selection. If any possibly related terms are found. terms are stemmed at search time rather than at indexing time. There are two ways stemming can be incorrect--overstemming and understemming.

Storing precomputed data. Using a B-tree or hash table.htm (3 of 42)7/3/2004 4:20:13 PM . For these terms.2 TYPES OF STEMMING ALGORITHMS There are several approaches to stemming. Having a user select the terms from the set found by the stemmer also reduces the likelihood of false matches. since they are domain dependent that is. Another problem is the storage overhead for such a table. and a body of text in place of phonemically transcribed utterances. D i i is defined as i. some other stemming method would be required. The first is that there is no such data for English. i. the stemmer can be turned off by the user. Bentley (1982).Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS searchers to focus their attention on other search problems. Hafer and Weiss formally defined the technique as follows: Let be a word of length n.Books_Algorithms_Collection2ed/books/book5/chap08. 8. the subset of D containing those terms whose first i letters match denoted S exactly. though trading size for time is sometimes warranted. is useful when the computations are frequent and/or expensive. for example. One way to do stemming is to store a table of all index terms and their stems. not standard English. is a length i prefix of .1 Successor Variety Successor variety stemmers (Hafer and Weiss 1974) are based on work in structural linguistics which attempted to determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances. Since stemming may not always be appropriate. Let D be the corpus of words. as opposed to computing the data values on the fly. The successor variety of i. Even if there were. For example: Term Stem --------------------engineering engineered engineer engineer engineer engineer Terms from queries and indexes could then be stemmed via table lookup. such lookups would be very fast. many terms found in databases would not be represented. 8. reports cases such as chess computations where storing precomputed results gives significant performance improvements..2. is then defined as the number of distinct letters that occupy the i + 1st position of words in D file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The stemming method based on this work uses letters in place of phonemes.. There are problems with this approach.

the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. The probability that a member of D i has the successor j is given by . a set of entropy measures can be determined for a word. 3. A test word of length n has n successor varieties S i. able. Let |D i| be the number of words in a text body beginning with the i length sequence of letters . the successor variety of "a" is four. a break is made after a segment if the segment is a complete word in the corpus. some cutoff value is selected for successor varieties and a boundary is identified whenever the cutoff value is reached. Using the cutoff method. accident. Hafer and Weiss experimentally evaluated the various segmentation methods using two criteria: (1) the number file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The entropy method takes advantage of the distribution of successor variety letters. Let |D ij| be the number of words in D i with the successor j. Once the successor varieties for a given word have been derived. 1. To determine the successor varieties for "apple." "a" is followed in the text body by four characters: "b. The entropy of |D i| is Using this equation. At this point. since only "e" follows "ap" in the text body.." Thus. 2. if too large. a segment break is made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it. and so on. ape. In the complete word method method. 4. This information is used to identify stems.htm (4 of 42)7/3/2004 4:20:13 PM . about. correct cuts will be missed. the successor variety of a string is the number of different characters that follow it in words in some body of text.. the following process would be used. The first letter of apple is "a. This method removes the need for the cutoff value to be selected. this information must be used to segment the word." "x. axle.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS i. The method works as follows. The problem with this method is how to select the cutoff value--if it is too small.. Hafer and Weiss discuss four ways of doing this.." and "p. A set of entropy measures for predecessors can also be defined similarly. A cutoff value is selected.S 2 n." "c. When this process is carried out using a large body of text (Hafer and Weiss report 2.Books_Algorithms_Collection2ed/books/book5/chap08. In less formal terms. incorrect cuts will be made. The next successor variety for apple would be one.. With the peak and plateau method. Consider a body of text consisting of the following words. for example. S . and a boundary is identified whenever the cutoff value is reached.." for example.000 terms to be a stable number). the successor variety will sharply increase.

and (2) the number of correct segment cuts divided by the total number of true boundaries.D D A. RIPE. Test Word: READABLE Corpus: ABLE.. READ. consider the example below where the task is to determine the stem of the word READABLE. BEATABLE. Hafer and Weiss used the following rule: if (first segment occurs in <= 12 words in corpus) first segment is stem file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. READS. APE.I.Books_Algorithms_Collection2ed/books/book5/chap08. To illustrate the use of successor variety stemming.S B L E BLANK Using the complete word segmentation method. After a word has been segmented.htm (5 of 42)7/3/2004 4:20:13 PM . READABLE READING.I.. They found that none of the methods performed perfectly.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS of correct segment cuts divided by the total number of cuts. ROPE. The peak and plateau method would give the same result. Prefix Successor Variety Letters ----------------------------------------R RE REA READ READA READAB READABL READABLE 3 2 1 3 1 1 1 1 E.O A. RED. but that techniques that combined certain of the methods did best." since READ appears as a word in the corpus. FIXABLE. the test word "READABLE" will be segmented into "READ" and "ABLE. the segment to be used as the stem must be selected.

8. forming a similarity matrix. ta. "statistics" has nine digrams.2." this is a bit confusing since no stem is produced. The retrieval effectiveness of the Hafer and Weiss stemmer is discussed below. Dice's coefficient would equal (2 x 6)/(7 + 8) = . For the example above. which is defined as where A is the number of unique digrams in the first word. Once the unique digrams for the word pair have been identified and counted.htm (6 of 42)7/3/2004 4:20:13 PM . The authors report that because of the infrequency of multiple prefixes in English. or n-grams could be used.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS else (second segment is stem) The check on the number of occurrences is based on the observation that if a segment occurs in more than 12 words in the corpus. In summary.2 n-gram stemmers Adamson and Boreham (1974) reported a method of conflating terms called the shared digram method. ti. statistics => st ta at ti is st ti ic cs unique digrams = at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique digrams = al at ca ic is st ta ti Thus. Using this rule in the example above. it is probably a prefix. Such similarity measures are determined for all pairs of terms in the database. we have called it the n-gram method. seven of which are unique.80. and (3) select one of the segments as the stem. they require human preparation of suffix lists and removal rules. a similarity measure based on them is computed.. is. The two words share six unique digrams: at. The similarity measure used was Dice's coefficient.. They point out that while affix removal stemmers work well. In this approach. and "statistical" has ten digrams. the successor variety stemming process has three parts: (1) determine the successor varieties for a word. For example. (2) use this information to segment the word using one of the methods above. no segment beyond the second is ever selected as the stem. Since Dice's coefficient is symmetric (Sij = Sji). Their stemmer requires no such preparation. eight of which are unique. Since trigrams. The aim of Hafer and Weiss was to develop a stemmer that required little or no human processing. association measures are calculated between pairs of terms based on shared unique digrams. the terms statistics and statistical can be broken into digrams as follows. A digram is a pair of consecutive letters. B the number of unique digrams in the second. Though we call this a "stemming method. and C the number of unique digrams shared by A and B. a lower triangular similarity matrix can be file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. READ would be selected as the stem of READABLE. st.Books_Algorithms_Collection2ed/books/book5/chap08. ic.

. double *smatrix[]. /* number of unique digrams in word 2 */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. /* calculate similarity matrix for words based on digrams */ #define MAXGRAMS 50 #define GRAMSIZE 2 void digram_smatrix (wordlist. word1 word2 word3.wordn-1 word1 word2 S21 word3 S31 S32 wordn Sn1 Sn2 Sn3 Sn(n-1) Once such a similarity matrix is available. /* loop counters */ int uniq_in_wordl. /* number of unique digrams in word 1 */ int uniq_in_word2. { int i. j. The algorithm for calculating a digram similarity matrix follows.. word_list_length.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS used as in the example below.htm (7 of 42)7/3/2004 4:20:13 PM . .Books_Algorithms_Collection2ed/books/book5/chap08. smatrix) char *wordlist[]. terms are clustered using a single link clustering method as described in Chapter 16.. /* list of sorted unique words */ /* length of wordlist */ /* maximum n-grams for a word */ /* size of the n-gram */ int word_list_length.

++i) for (j=i+1.++j) { /* find unique digrams for first word in pair */ uniq_in_word1 = unique_digrams(wordlist [i]. the similarity matrix will be sparse. uniq_digrams 1). /* number of unique digrams shared by words 1 and 2 */ char uniq_digrams_1 [MAXGRAMS] [GRAMSIZE]. /* calculate similarity value and store in similarity matrix */ smatrix[i][j] = (2*common_uniq)/(uniq_in_word1+uniq_in_word2). in hardly any cases did the method form false associations.. Using a cutoff similarity value of . Adamson and Boreham found that most pairwise similarity measures were 0.2. /* array of digrams */ char uniq_digrams_2 [MAXGRAMS] [GRAMSIZE]. /* array of digrams */ int unique_digrams(). Thus. i< word_list_length. 8.htm (8 of 42)7/3/2004 4:20:13 PM .Books_Algorithms_Collection2ed/books/book5/chap08. they found ten of the eleven clusters formed were correct. /* find number of common unique digrams */ common_uniq = common_digrams(uniq_digrams_1. word */ int common_digrams(). uniq_digrams_2). /* find unique digrams for second word in pair */ uniq_in_word2 = unique_digrams(wordlist [j].6. The authors also report using this method successfully to cluster document titles. and techniques for handling sparse matrices will be appropriate. More significantly.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS int common_uniq. j <word_list_length . } /* end for */ } /* end digram_smatrix */ When they used their method on a sample of words from Chemical Titles.. */ /* function to calculate # of unique digrams in /* function to calculate # of shared unique digrams for ( i=0.3 Affix Removal Stemmers file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. uniq_digrams_2).

Dawson (1974). A set of rules for such a stemmer is as follows (Harman 1991). and Paice (1990). we might say that two stems are equivalent if they agree in all but their last characters. This process is repeated until no more characters can be removed." for example. These algorithms sometimes also transform the resultant stem. Recoding is a context sensitive transformation of the form AxC -> AyC where A and C specify the context of the transformation. The conditions fall into three classes: conditions on the stem. Even after all characters have been removed. or "oes" then "es" -> "e" If a word ends in "s". conditions on the suffix.htm (9 of 42)7/3/2004 4:20:13 PM . A simple example of an affix removal stemmer is one that removes the plurals from terms. There are several types of stem conditions. denoted m. Salton." then i -> y. the Porter algorithm is more compact than Lovins.. iterative longest match stemmers have been reported by Salton (1968).Books_Algorithms_Collection2ed/books/book5/chap08. specify that if a stem ends in an "i" following a "k. we chose the Porter algorithm as the example of this type of stemmer. only the n initial characters of stems are used in comparing them. If a word ends in "ies" but not "eies" or "aies" Then "ies" -> "y" If a word ends in "es" but not "aes". stems may not be correctly conflated. o. a kind of affix removal stemmer first developed by Lovins (1968). "ees". file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. The measure. and Dawson. to give retrieval performance comparable to the larger algorithms." There are two techniques to handle this--recoding or partial matching. Using this approach. An iterative longest match stemmer removes the longest possible string of characters from a word according to a set of rules. and seems. Porter (1980). i. The word "skies. may have been reduced to the stem "ski" which will not match "sky. x is the input string. and conditions on the rules. As discussed below. for example. but not "us" or "ss" then "s" -> NULL In this algorithm only the first applicable rule is used. We might. e. The Paice stemmer is also compact.. and y is the transformed string. In partial matching. In addition to Lovins. Vowels are a. of a stem is based on its alternate vowel-consonant sequences. The Porter algorithm consists of a set of condition/action rules. Most stemmers currently in use are iterative longest match stemmers. on the basis of experimentation. but since experimental data was not available for the Paice algorithm. 1.

and V for a sequence of vowels. Rule conditions take the form: (rule was used). The longest possible suffix is always removed because of the ordering of the rules within a step. which is the measure. indicates the number of VC sequences. is defined as [C](VC)m[V] The superscript m in the equation. *d--the stem ends in a double consonant 5. The rules in a step are examined in sequence. The algorithm is as follows. Measure Examples --------------------------------------m=0 m=1 m=2 TR.. *v*--the stem contains a vowel 4. x.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS u. then. { step1a(word). OATEN 2. sequence. Some examples of measures for terms follows. IVY TROUBLES..htm (10 of 42)7/3/2004 4:20:13 PM . OATS. and only one rule from a step can apply. step1b(stem). Actions are rewrite rules of the form: old_suffix -> new_suffix The rules are divided into steps. * < X >--the stem ends with a given letter X 3. Suffix conditions take the form: (current_suffix == pattern). TREE. Square brackets indicate an optional occurrence. TREES. EE. where the final consonant is not w. Y.ooks_Algorithms_Collection2ed/books/book5/chap08. and y if preceded by a consonant. Consonants are all letters that are not vowels. or y. if (the second or third rule of step 1b was used) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Let C stand for a sequence of consonants. BY TROUBLE. PRIVATE. The measure m. *o--the stem ends with a consonant-vowel-consonant.

step5a(stem).htm (11 of 42)7/3/2004 4:20:13 PM . step2(stem).. step4(stem). } The rules for the steps of the stemmer are as follows. Step 1a Rules Conditions Suffix Replacement Examples --------------------------------------------------NULL NULL sses ies ss i caresses -> caress ponies -> poni ties -> tie NULL NULL Step 1b Rules Conditions Suffix Replacement Examples ss s ss NULL carress -> carress cats -> cat ---------------------------------------------------(m>0) eed ee feed-> feed agreed -> agree file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap08. step5b(stem).Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS step1b1(stem). step3(stem).. step1c(stem).

Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS (*v*) ed NULL plastered-> plaster bled -> bled (*v*) ing NULL motoring-> motor sing-> sing Step 1b1 Rules Conditions Suffix Replacement Examples -----------------------------------------------------------------------NULL NULL NULL (*d and not (*<L> or *<S> or *<Z>)) NULL single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) NULL e fail(ing) -> fail fil(ing) -> file Step 1c Rules Conditions Suffix Replacement Examples at bl iz ate ble ize conflat(ed) -> conflate troubl(ing) -> trouble siz(ed) -> size ----------------------------------------------------------- file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..ooks_Algorithms_Collection2ed/books/book5/chap08.htm (12 of 42)7/3/2004 4:20:13 PM ..

> happi sky -> sky Step 2 Rules Conditions Suffix Replacement Examples -------------------------------------------------------------(m>0) (m>0) ational tional ate tion relational -> relate conditional -> condition rational -> rational (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) enci anci izer abli alli entli eli ousli ization ation ator alism iveness fulness ence ance ize able al ent e ous ize ate ate al ive ful valenci -> valence hesitanci -> hesitance digitizer -> digitize conformabli -> conformable radicalli -> radical differentli -> different vileli -> vile analogousli -> analogous vietnamization -> vietnamize predication -> predicate operator -> operate feudalism -> feudal decisiveness -> decisive hopefulness -> hopeful file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..ooks_Algorithms_Collection2ed/books/book5/chap08..Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS (*v*) y i happy .htm (13 of 42)7/3/2004 4:20:13 PM .

..ooks_Algorithms_Collection2ed/books/book5/chap08.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS (m>0) (m>0) (m>0) (m>0) Step 3 Rules Conditions ousness aliti iviti biliti ous al ive ble callousness -> callous formaliti -> formal sensitiviti -> sensitive sensibiliti -> sensible Suffix Replacement Examples -------------------------------------------------------(m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) Step 4 Rules Conditions Suffix Replacement Examples icate ative alize iciti ical ful ness ic NULL al ic ic NULL NULL triplicate -> triplic formative -> form formalize -> formal electriciti -> electric electrical -> electric hopeful -> hope goodness -> good --------------------------------------------------------------------(m>1) (m>1) (m>1) (m>1) (m>1) al ance ence er ic NULL NULL NULL NULL NULL revival -> reviv allowance -> allow inference -> infer airliner-> airlin gyroscopic -> gyroscop file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (14 of 42)7/3/2004 4:20:13 PM .

..> rate (m=1 and not *o) Step 5b Rules e NULL cease .htm (15 of 42)7/3/2004 4:20:13 PM .ooks_Algorithms_Collection2ed/books/book5/chap08.> ceas file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS (m>1) (m>1) (m>1) (m>1) (m>1) (m>1) (m>1 and (*<S> or *<T>)) (m>1) (m>1) (m>1) (m>1) (m>1) (m>1) (m>1) Step 5a Rules Conditions Suffix able ible ant ement ment ent ion ou ism ate iti ous ive ize NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL adjustable -> adjust defensible -> defens irritant -> irrit replacement -> replac adjustment -> adjust dependent -> depend adoption -> adopt homologou-> homolog communism-> commun activate -> activ angulariti -> angular homologous -> homolog effective -> effect bowdlerize -> bowdler Replacement Examples --------------------------------------------------------(m>1) e NULL probate -> probat rate .

For Cranfield-1. 476 differences were used for the sign test (the 14 dependent variables measures 34 queries). Both of these statistical methods yielded significant results at the .05 probability level. were calculated on 14 dependent variables for each query: rank recall. and precision for ten recall levels. in C. (The effect size is the percentage of the variance of the independent variable accounted for by the dependent variables. employing about 200 endings. . these measures are probably intercorrelated.. the ADI collection consisting of 82 papers and 35 search requests. and the results for the Cranfield-1 collection. As Salton points out. Salton (1968) examined the relative retrieval performance of fully stemmed terms against terms with only the suffix "s" removed.175. normalized precision. 132 pairs favored suffix "s" stemming. the results reported must be viewed with caution. The most striking feature of these experiments is the discrepancy between the results for the IRE-3 and ADI collections.50*). * The effect size for a sign test can only take on values in the range 0 to . and since the inferential tests used require independence of the dependent variables. For the IRE-3 collection.3 EXPERIMENTAL EVALUATIONS OF STEMMING There have been many experimental evaluations of stemmers. fully stemmed and "s" stemmed. The results for the Cranfield-1 collection are opposite those for IRE-3 and ADI. and since sufficient data is provided for the estimate of effect size from the sign tests. The stemmer used was an iterative longest match stemmer of the type described above. Three document collections were used in these studies: the IRE-3 collection consisting of 780 computer science abstracts and 34 search requests. and 183 favored neither. the effect size for the IRE-3 collection is 272/ (272+132) .Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS Conditions Suffix Replacement Examples ------------------------------------------------------------------(m> 1 and *d and *<L>) NULL single letter controll -> control roll -> roll A full implementation of this stemmer. Thus. and the expected proportion if there is no difference between means (that is.htm (16 of 42)7/3/2004 4:20:13 PM . log precision.ooks_Algorithms_Collection2ed/books/book5/chap08..5. Of these. and the Cranfield-1 collection consisting of 200 aerodynamics abstracts and 42 search requests. and in 72 cases neither method was favored.50 = . The effect size for this collection is . 8. 371 favored the suffix "s" stemming. 134 cases favored full stemming. the latter will be discussed here. Differences between the two conflation methods. and 129 cases favored neither. the number of cases favoring a given method over the number of all cases where a difference was detected). This data was analyzed using both related group t tests and sign tests. The effect size for this collection is . precluding their use in an estimate of effect size.20. but since none of the t values are reported. 272 pairs favored full stemming. 107 cases favored suffix "s" stemming.. is in the appendix to this chapter. Salton offers the plausible explanation that the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.235.) The effect size for a sign test is the difference between the obtained proportion (that is. 254 cases were found to favor full stemming. In the test of ADI collection. normalized recall.

to conclude that conflation may have a significant effect on retrieval performance. however.26 for highly relevant. and on all document recall. The effect sizes. Hafer and Weiss (1974) tested their stemmer against other stemming methods using the ADI collection. For the ADI collection. Relevance judgments on a scale of 1-4 (highly relevant to nonrelevant) were obtained from the users who requested the searches. All methods of stemming outperformed full terms for both test collections. but that the observed differences are probably not meaningful. descriptors. He states: To be able to differentiate between the various document abstracts. while a small effect was found using a t-test. The stemmer used was a simple suffix removal stemmer employing 20 endings. unstemmed title terms. No statistical testing was reported. On no dependent variable did stemmed terms perform less well than other representations.htm (17 of 42)7/3/2004 4:20:13 PM . either procedure might reasonably be used" . by hand. The sign test is insensitive to the magnitude of differences. but that the effect may depend on the vocabulary involved. "For none of the collections (ADI. The method and criteria used are not reported. Both of these stemmers are of the longest match type. and these finer distinctions are lost when several words are combined into a unique class through the suffix cutoff process.000 documents from Computer and Control Abstracts.20 for all relevant. . Van Rijsbergen et al. IRE-3. It may be. Seven experienced searchers were used in the study. It seems much more reasonable. different searchers searched different file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. and recall and precision for all documents. The results of this study must be viewed with some caution because for a given query. The database consisted of approximately 12. and in some cases they performed significantly better than descriptors. Comparisons were made on the basis of recall-precision plots. The dependent variable measures used were recall and precision for highly relevant documents. One of Salton's assertions about these experiments seems ill founded in light of the effect sizes calculated above. that Salton's remarks are based on unreported t-test data which may have indicated small effect sizes. and thus a large effect might be found using a sign test.ooks_Algorithms_Collection2ed/books/book5/chap08. it is . They report that the performance of their stemmer was slightly better across ten paired recall-precision levels. No statistical results of any kind are reported. stemming did significantly better than both descriptors and title terms. with each searcher searching different representations for each query. and . important to maintain finer distinctions for the Cranfield case than for ADI and IRE. The authors also determined "each family derived from a common stem . the Hafer and Weiss stemmer outperformed the SMART stemmer." .. Katzer et al. In a test using the CRC collection. (1980) tested their stemmer (Porter 1980) against the stemmer described by Dawson (1974) using the Cranfield-1 test collection. the Hafer and Weiss stemmer performed equally to the manually derived stems. . . so that in practice. (1982) examined the performance of stemmed title-abstract terms against six other document representations: unstemmed title-abstract terms. unstemmed abstract terms. Cranfield-1) is the improvement of one method over the other really dramatic. identifiers. He states that. given the differences observed between the methods.. . and the Carolina Population Center (CPC) Collection consisting of 75 documents and five queries.22 and . though the Dawson stemmer is more complex.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS discrepancy is due to the highly technical and homogenous nature of the Cranfield vocabulary. were . and descriptors and identifiers combined. calculated here on the basis of reported data.

The various methods were evaluated at b levels of . At the b = . and identifiers. This procedure was done for each of the methods used. the Cranfield-1400 collection was used.ooks_Algorithms_Collection2ed/books/book5/chap08. The experiments were done using a database of approximately 12. Lennon et al. but were set by the experimenters. or even identify which pairs of methods were significantly different. (1981) examined the performance of various stemmers both in terms of retrieval effectiveness and inverted file compression.05 probability level of significance in the precision oriented searches" . the RADCOL stemmer. The titles of documents and the queries were matched.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS representations. the two questions that this study addressed concerned (1) the relative effectiveness of stemming versus nonstemming. This means that observed differences between representations may to some degree have been caused by searcher-representation interactions.396 documents. and a stemmer based on the frequency of word endings in a corpus. unstemmed terms were also used. that is. though Lennon et al. For the retrieval comparisons. a suffix frequency stemmer based on the RADCOL project. as defined in Chapter 1. In addition to most of the features of DIALOG. there is relatively little difference in terms of the dictionary compression and retrieval performance obtained from their use. simple. (1981) report that. a trigram stemmer. Thus. a stemmer developed by INSPEC.5 (indicating that a user was twice as interested in precision as recall). the INSPEC.htm (18 of 42)7/3/2004 4:20:13 PM . The first tested the hypothesis that the closeness of stems to root morphemes will predict improved retrieval effectiveness. and nonstemmed terms. only the Hafer and Weiss stemmer did worse than unstemmed terms at the two levels of b. and (2) the relative performance of the various stemmers examined. The eight stemmers used in this study were: the Lovins stemmer. "a few of the differences between algorithms were statistically significant at the . all but Hafer and Weiss did better than unstemmed terms. The retrieval system used was DIATOM (Waldstein 1981).000 document representations from six months of psychological abstracts. Lovins. Fifty-three search queries were solicited from users and searched by four professional searchers under four representations: title.. and 2 (indicating that a user was twice as interested in recall as precision). This assertion seems to be contradicted by the significant differences found between algorithms. As for the second question. the Porter stemmer. eight stemming methods. However. Porter. In terms of the first question. These levels of b were not chosen by users. in particular. the paucity of reported data in this study makes an independent evaluation of the results extremely difficult. The evaluation measure was E. and both frequency stemmers all did better than unstemmed terms.. the Hafer and Weiss stemmer. As stated above. trigram. and also 225 queries. they fail to report any test statistics or effect sizes. Frakes (1982) did two studies of stemming. DIATOM allows stemmed searching file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. and at the b = 2 level. They conclude that Despite the very different means by which the various algorithms have been developed.5 level. descriptors. This information was derived from a table of means reported for the two b levels. This collection contains 1. A cutoff was then applied to specify a given number of the highest ranked documents. abstract. and the documents ranked in order of decreasing match value. fully automated methods perform as well as procedures which involve a large degree of manual involvement in their development. a DIALOG simulator.

The hypothesis adopted for the second experiment was that stemming will perform as well or better than manual conflation. Data was derived by re-executing each search with stemmed terms in place of the manually conflated terms in the original searches. stated that "the application of truncation to content terms cannot be done automatically to duplicate the use of truncation by intermediaries because any single rule [used by the conflation algorithm] has numerous exceptions" ( p. To test this hypothesis.035 and a standard deviation of . To test this hypothesis..ooks_Algorithms_Collection2ed/books/book5/chap08. The dependent variables were E measures at the three trade-off levels as described above. The mean number of characters that terms truncated by searchers varied from root boundaries in a positive direction was . indicating that conflation can be automated with no significant loss in retrieval performance. together with stemmed searches. searches based on 25 user need statements were analyzed.825 with a standard deviation of . The hypothesis for the first experiment. and checking the results against the linguistic intuitions of two judges. User need statements were solicited from members of the Syracuse University academic community and other institutions. since searches of these fields were most likely to need conflation. Root morphemes were determined by applying a set of generation rules to searcher supplied full-term equivalents of truncated terms. However. The independent variables were positive and negative deviations of truncation points from root boundaries. thus. As is the case with almost all IR systems. relevant documents retrieved under all representations were used in the calculation of recall measures. b = 1 (precision equals recall in importance). the experiments did not address prefix removal. b = 2 ( recall twice as important as precision). where deviation was measured in characters. They used Porter's file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. title. Smith (1979). However. Deviations in a negative direction were even smaller with a mean of . was that terms truncated on the right at the root morpheme boundary will perform better than terms truncated on the right at other points. Walker and Jones (1987) did a thorough review and study of stemming algorithms. no empirical test of the relative effectiveness of automatic and manual conflation had been made. This study made use of only title and abstract searches. and the nonparametric Spearman's test for correlation. some user statements were searched using a stemmed title-abstract representation. Each user need statement was searched by four experienced searchers under four different representations (abstract. The purpose of the second experiment was to determine if conflation can be automated with no loss of retrieval effectiveness.089.429. DIATOM allows only right and internal truncation. The dependent variables were E measures (see Chapter 1) at three trade-off values: b = . The primary independent variable was conflation method--with the levels automatic (stemming) and manual. based on linguistic theory. No significant difference was found between manual and automatic conflation. The tests of correlation revealed no significant relationship between the small deviations of stems from root boundaries. Each searcher did a different search for each representation. In addition.. and retrieval effectiveness. The data analysis showed that searchers truncated at or near the root morpheme boundaries. The blocking variable was queries. 223).5 ( precision twice as important as recall). and descriptors).Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS using Lovins' (1968) algorithm. identifiers. and searchers was included as an additional independent variable to increase statistical power. and permits search histories to be trapped and saved into files. in an extensive survey of artificial intelligence techniques for information retrieval. searches based on 34 user need statements were analyzed using multiple regression.htm (19 of 42)7/3/2004 4:20:13 PM .

The database used was an on-line book catalog (called RCL) in a library.0.. Table 8. but strong stemming does.htm (20 of 42)7/3/2004 4:20:13 PM . we offer the following conclusions.. with the Hafer and Weiss stemmer in the study by Lennon et al. Otherwise there is no evidence that stemming will degrade retrieval effectiveness.5. The failure of some of the authors to report test statistics. and that weak stemming does not significantly decrease precision. Attempts to improve stemming performance by reweighting stemmed terms.1: Stemming Retrieval Effectiveness Studies Summary Study Question Test Collection Dependent Vars Results ----------------------------------------------------------------------Salton Full stemmer IRE-3 14 DV's Full > s file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. make interpretation difficult. and CACM--and found that none of them significantly improved retrieval effectiveness in a ranking IR system called IRX. Where effects have been found. and S removal--on three databases--Cranfield 1400. Another finding was that the number of searches per session was about the same for weak and strong stemming. There also appears to be little difference between the retrieval effectiveness of different full stemmers. 1. Stemming is as effective as manual conflation.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS stemming algorithm in the study. and by stemming only shorter queries. Lovins. defined as step 1 of the Porter algorithm. Salton's results indicate that the effect of stemming is dependent on the nature of the vocabulary used. These studies must be viewed with caution. study. at b levels of . especially effect sizes. They also found that stemming significantly increases recall. Harman (1991) used three stemmers--Porter. Stemming can affect retrieval performance. with the exception of the Hafer and Weiss stemmer.. but the studies are equivocal. gave less compression.0. being the exceptions.1 summarizes the various studies of stemming for improving retrieval effectiveness. The dependent variable measure was again E. the majority have been positive. and the effect of the strong stemmer in the Walker and Jones study. Medlars. their validity is questionable.3. One of their findings was that since weak stemming. They recommend that weak stemming be used automatically. which gave poorer performance in the Lennon et al. stemming weakness could be defined by the amount of compression. and 2.1 Stemming Studies: Conclusion Table 8.ooks_Algorithms_Collection2ed/books/book5/chap08. 8. with strong stemming reserved for those cases where weak stemming finds no documents. failed in this study. A specific and homogeneous vocabulary may exhibit different conflation properties than will other types of vocabularies. Since some of the studies used sample sizes as small as 5 (Hafer and Weiss). Given these cautions.

htm (21 of 42)7/3/2004 4:20:13 PM .ooks_Algorithms_Collection2ed/books/book5/chap08. unstemmed ADI CRC Stemmed > Unstemmed CRC HW=Manual precision (P) ADI recall(R). suffix s stemmer Cranfield-1 s > Full ADI Full > s ----------------------------------------------------------------------Hafer and Weiss HW stemmer vs. manual stemming stemmed vs. Porter vs. HW > SMART ----------------------------------------------------------------------VanRijsbergen et al.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS vs. CCA R. SMART stemmer HW stemmer vs.P Porter = Dawson ----------------------------------------------------------------------Katzer stemmed vs.P(highest) stemmed = file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. Dawson stemmer Cranfield-1 R..

stemming truncation= stemming Psychabs E(. 12.ooks_Algorithms_Collection2ed/books/book5/chap08. descriptors ----------------------------------------------------------------------Lennon et al.000 R.1.P(all) unstemmed stemmed > title. other reps.. unstemmed Cranfield-1400 E(.5.. stemmed vs.2) stemmed > unstemmed > HW stemmer stemmer comparison ----------------------------------------------------------------------Frakes closeness of stem to root morpheme improves IR performance truncation vs.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS et al.htm (22 of 42)7/3/2004 4:20:13 PM . unstemmed weak stemmed PCL P PCL R stemmed > unstemmed weak stemmed= file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.2) No improvement stemmers = ----------------------------------------------------------------------Walker and Jones stemmed vs.5. unstemmed stemmed vs.

storing stems instead of full words can decrease the size of index files. For example.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS vs.7 39.8 34.8 49.ooks_Algorithms_Collection2ed/books/book5/chap08. for example.2 30.9 40.5 45. Index Compression Percentages from Stemming Stemmer Cranfield National Physical Laboratory INSPEC Brown Corpus ---------------------------------------------------------------------INSPEC Lovins RADCOL Porter Frequency 32. the indexing file for the Cranfield collection was 32. Lennon et al. unstemmed strong stemmed PCL vs.1 38.1 percent smaller after it was stemmed using the INSPEC stemmer. report the following compression percentages for various stemmers and databases.5 39.5. unstemmed Medlars CACM " " " " Cranfield-1400 E(..8 33.2 41.8 50. (1981).1.9 32.5 41. unstemmed P unstemmed strong stemmed < unstemmed ----------------------------------------------------------------------Harman stemmed vs..2) stemmed = unstemmed 8.7 40.htm (23 of 42)7/3/2004 4:20:13 PM .1 47.1 26.1 30.4 STEMMING TO COMPRESS INVERTED FILES Since a stem is usually shorter than the words to which it corresponds.6 39.8 40.5 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

In the next chapter.38 megabytes. and n-gram.. She points out. J. FRAKES. that for real world databases that contain numbers. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. "The Use of an Association Measure Based on Character Structure to Identify Semantically Related Pairs of Words and Document Titles. BOREHAM. 33-46. REFERENCES ADAMSON. BENTLEY. and a full implementation of it. 253-60. "Suffix Removal and Word Conflation. the index was reduced 20 percent. affix removal. misspellings.2 200 33. Suffix Set Size: NPL Database -----------------------------------------------------------------Number of Suffixes Compression 100 28. however. 1974. from .6 8. in general. Studies of the effects of stemming on retrieval effectiveness are equivocal. reductions are smaller." Information Storage and Retrieval.8 megabytes--a savings of 13. G. study. and J.47 megabytes to . sometimes decreasing the size of the files as much as 50 percent. DAWSON.5 percent. Harman (1991) reports that for a database of 1. the compression factors are not nearly so large. and the like. proper names. 1974. Compression vs.7 to 5.ooks_Algorithms_Collection2ed/books/book5/chap08. Compression rates increase for affix removal stemmers as the number of suffixes increases. For larger databases.htm (24 of 42)7/3/2004 4:20:13 PM . shows.: Prentice Hall. the index was reduced from 6. the thesaurus approach to conflating semantically related words is discussed. The Porter stemmer was described in detail. Doctoral Dissertation. 1982. Writing Efficient Programs.. Syracuse University. using the Lovins stemmer. Stemming will. Stemming can have a marked effect on the size of indexing files. B. For a database of 50 megabytes.0 800 41.J. however. or a positive effect.6 megabytes of source text. also from the Lennon et al. W. increase recall at the cost of decreased precision. Term Conflation for Information Retrieval." ALLC Bulletin.5 SUMMARY Stemmers are used to conflate terms to improve retrieval effectiveness and/or to reduce the size of indexing files. as the following table. on retrieval performance where the measures used include both recall and precision. but in general stemming has either no effect..3 500 38.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS It is obvious from this table that the savings in storage using stemming can be substantial. successor variety.9 400 37.9 600 40.1 700 41. is presented in the appendix to this chapter. in C. Michelmas. Englewood Cliffs.7 300 35. J. 1982. 10. Several approaches to stemming were described--table lookup. N.

1986.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS FRAKES. 1990. M. C. "LATTIS: A Corporate Library and Information System for the UNIX Environment. and M. W. 1. 5587." Proceedings of the National Online Meeting. 1981. New Models in Probabilistic Information Retrieval. M." Information Storage and Retrieval. TESSIER. 14(3)." British Library Research Paper no.: Learned Information Inc. 68-72. Automatic Information Organization and Retrieval. 24. HAFER. 1982. DAS GUPTA. B. 1974. ROBERTSON. "An Evaluation of Some Conflation Algorithms for Information Retrieval. "Improving Subject Retrieval in Online Catalogues. New York: Cambridge University Press. M. FRAKES. TARRY. 10. 24(3). B. R. E. B. 1968.. PAICE. 7-15. 1980. "Term Conflation for Information Retrieval" in Research and Development in Information Retrieval. 1991.ooks_Algorithms_Collection2ed/books/book5/chap08.." Online.. 137-42. APPENDIX /******************************* stem. Doctoral Dissertation. 22-31. L. B. Selected Artificial Intelligence Techniques in Information Retrieval. J. J. "An Algorithm for Suffix Stripping." Journal of Information Science 3. "Diatom: A Dialog Simulator. PORTER. WEISS. 130-37. 261-73. S. and S. ed. C. MCGILL." Mechanical Translation and Computational Linguistics. Medford. SMITH. M. JONES.J. 1980. and P. New York: McGraw-Hill. S." Program. D. F. 11(1-2). 56-61." Information Technology: Research and Development. FRAKES. HARMAN. and R..c ********************************** file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. VAN RIJSBERGEN. and P. 1987. KATZER. WALKER. Cambridge: British Library Research and Development Report no." ACM SIGIR Forum. vol. "A Study of the Overlaps among Document Representations. 371-85. PIERCE. "Word Segmentation by Letter Successor Varieties. W. WALDSTEIN. J. SALTON. C. 42(1). PORTER.. 1. "How Effective is Suffixing?" Journal of the American Society for Information Science. 5.. J. G. van Rijsbergen. LENNON.htm (25 of 42)7/3/2004 4:20:13 PM . N. LOVINS. C. 1981.. "Development of a Stemming Algorithm. WILLETT. F. 1968. Syracuse University. D. 1979. 1984. 177-83.. "Another Stemmer. W.

Written by C.ooks_Algorithms_Collection2ed/books/book5/chap08. Cox **/ 1. Fox. static int RemoveAnE( /* word */ ). This program is a rewrite of earlier versions by B. Frakes /************************ Standard Include Files *************************/ #include <stdio.. static int AddAnE( /* word */ ). rule */ ).h> #include <string..1 Implementation of the Porter stemming algorithm.h> #include <ctype. static int ContainsVowel( /* word */ ).htm (26 of 42)7/3/2004 4:20:13 PM . static int EndsWithCVC( /* word */ ). /***************** Private Defines and Data Structures *******************/ #define FALSE #define TRUE #define EOS #define IsVowel(c) == (c) | | 'u' == (c)) 0 1 '\0' ('a' == (c) | | 'e' == (c) | | 'i' == (c) | | 'o' file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. static int ReplaceEnd( /* word. 1990.h> /******************** Private Function Declarations **********************/ static int WordSize( /* word */ ).Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS Version: Purpose: Provenance: Notes: and S.

0. NULL. LAMBDA. 0. NULL. "ss". "eed".. "ss". 103.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS typedef struct { int id. 1. char *old_end. NULL. 2. int new_offset. 0. int (*condition)(). -1. /* the constant empty string */ /* pointer to the end of the word */ /* returned if rule fired */ /* suffix replaced */ /* suffix replacement */ /* from end of word to start of suffix */ /* from beginning to end of new suffix */ /* min root word size for replacement */ /* the replacement test function */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap08. static char *end. 102. NULL. "s". "ies". char *new_end. NULL. NULL. int old_offset. static char LAMBDA[1] = " ". 0. int min_root_size. 000.htm (27 of 42)7/3/2004 4:20:13 PM . -1. 104. static RuleList step1a_rules[] = { 101.. NULL. static RuleList step1b_rules[] = { 105. 1. "i". 3. "sses". 0. -1. }. -1. 1. 2. -1. } RuleList. 0. 1. "ss". "ee". NULL.

"ww". "r". 1. -1. 116. "w". "t". NULL. "f". NULL. NULL. -1. NULL. 1. NULL. 0. -1.ooks_Algorithms_Collection2ed/books/book5/chap08. -1. LAMBDA. NULL. 1. -1. -1. 0. -1. -1. -1. "tt". NULL. 1. "b". "ing". NULL. -1. 1. NULL. -1. "n". 2. "ble". 118. 1.. 114. 0. 107. "rr". "gg". "m".. "bl". 111. AddAnE. 119. 0. "e". NULL. ContainsVowel. 2. "ed". 1. 1. 0. -1. 112. 0. "pp". 0. LAMBDA. -1. -1. "bb" "dd" "ff". ContainsVowel. 2. "iz". 0. "nn". 1.htm (28 of 42)7/3/2004 4:20:13 PM .Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS 106. -1. 2. NULL. 0. 110. NULL. "xx". 115. static RuleList step1b1_rules[ ] = { 108. 122. 0. NULL. 0. NULL. 120. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. "p". -1. 113. -1. NULL. 1. 1. }. LAMBDA "ate". -1. 1. 1. 0. 0. 121. NULL. 117. NULL. -1. 000. "g". "ize". 0. -1. "mm. 0. "x". "at". 109. 1. 1. "d".

"izer". NULL. NULL. 0. 4. 0. "able". NULL. 3. 2. 0. static RuleList step2_rules[ ] = { 203. NULL. "ate". 2. 210.ooks_Algorithms_Collection2ed/books/book5/chap08. 3. 211. 2. 216. "enci". 208. "ization". NULL. 3.. 0. "alli". "al". NULL. NULL. 3. "ent". }. 0. "ize". 0. 0. 0. "tion". 206. NULL. "anci". 6. 4. } . NULL. 0. 000. 0. NULL.htm (29 of 42)7/3/2004 4:20:13 PM . 2. "tional". NULL. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. "ate". "y". 3. 214. NULL. 2. "eli". ContainsVowel. 0. "e". 0. "ence". 1.. "ize". "ation". 0. "ational". 3. NULL. 2. 209. 2. 0. 0. 207. NULL. -1. "ate". "abli". NULL. "entli". 2. 5. 0. static RuleList step1c_rules[ ] = { 123. 3. "ance".Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS 000. 0. 3. 204. NULL. 215. 4. 0. 213. 0. NULL. "ous". NULL. "i". 3. 0. 205. 0. "ousli". 3. 0. "ator". 6. NULL.

NULL.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS 217. NULL. 3. "al". 1. 5. LAMBDA. "al". NULL. 6. 5. 0. NULL. 1. 305. 0. 0. 218.ooks_Algorithms_Collection2ed/books/book5/chap08. 1. "aliti". 4. 0. NULL. 1. "ous". NULL. 302.htm (30 of 42)7/3/2004 4:20:13 PM "icate". 0. "ic". 6. static RuleList step4_rules[ ] = { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. 0. 0. }. 0. 2. 000. 1. LAMBDA. NULL. alize". 303. -1. 2. 223. "biliti". 220. "ic". 0. 2. "ness". 0. "alism". NULL. "ical". -1. 4. 4. 222. 4. 0. 0. NULL. "iciti". 308. "ic". 309. NULL. 2. 1. }. NULL. "fulnes". "ive". 0. -1. NULL. "ousness". "al". LAMBDA. 0. 219. 304. 4. "ble". "ive". 4. NULL. NULL. . NULL.. "ful". 0. 000. "ful". "iveness". 3. 221. NULL. 4. 0. 2. NULL. NULL. "iviti". NULL. NULL. 0. 2. 0. static RuleList step3_rules[ ] = { 301. "ative". 0. 0.

403. "ou". LAMBDA. -1. 0. 1. 3. 407. 1.. 2. 1. NULL. "ize". NULL. 2. -1. 3. 1. "ant". 0. NULL. 3. LAMBDA. -1. LAMBDA. NULL. 1. 1. 4. 416. 2. "ence". NULL. -1. 420. "ible". 408. LAMBDA. NULL. NULL. 2. "s". 000. LAMBDA. 1. 406. 1. "sion". "ate". LAMBDA. 3. 3. LAMBDA. 419. 0. 1. LAMBDA. 411. "tion". -1. -1. NULL. "ment". LAMBDA.ooks_Algorithms_Collection2ed/books/book5/chap08. LAMBDA. -1. NULL. NULL. NULL. 405. NULL. 3. LAMBDA. -1. NULL. "ic". 1. "t". 3. -1. 1. 2. -1. 0. -1. 418. 1. LAMBDA. "ous". 0. -1. -1. LAMBDA. -1. "iti". LAMBDA. -1. NULL. 1. "ism". "ive". 1.. 1. NULL. 423. 1. LAMBDA. "al". 1. 421. NULL. 1.htm (31 of 42)7/3/2004 4:20:13 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. NULL. NULL. 424. "able". NULL. LAMBDA. -1. 415. "ent". NULL. 1. 1. 2. 1. "er". 1. "ance". 410. 417. 402.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS 401. "ement". NULL. 409. -1. 1. -1. LAMBDA. NULL. 2. 2. 412. 1. LAMBDA. NULL.

}. RemoveAnE. NULL. 0. 1. 0. /*FN**************************************************************** WordSize( word ) Returns: Purpose: int -. static RuleList step5b_rules[ ] = { 503. 0. "e". static RuleList step5a_rules[ ] = { 501. "e".a weird WordSize of word size in adjusted syllables Count syllables in a special way: count the number vowel-consonant pairs in a word.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS }. NULL.. NULL. -1. 0. NULL. -1. NULL. otherwise (when it follows a consonant) it file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap08. LAMBDA. /*******************************************************************/ /******************** Private Function Declarations **************/ "ll". 0.htm (32 of 42)7/3/2004 4:20:13 PM . 0. 0. 000. 1. -1. NULL. 502. 000. NULL. disregarding initial consonants and final vowels. LAMBDA. 0. 0.. NULL. 1. "l". The letter "y" counts as a consonant at the beginning of a word and when it has a vowel in front of it. }.

of "anything" is 3. the WordSize of "cat" is 1. which is the "last letter was a vowel" state. **/ static int WordSize( word ) register char *word. then the machine changes to state 1. The initial state 0 checks the first letter.htm (33 of 42)7/3/2004 4:20:13 PM . For example. the "last letter was a consonant state". If it is a vowel.. a y is treated as a consonant (since it follows a vowel).ooks_Algorithms_Collection2ed/books/book5/chap08. Plan: Notes: Run a DFA to compute the word size The easiest and fastest way to compute this funny measure is with a finite state machine. y is treated as a vowel (since it follows a consonant. but in state 2.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS is treated as a vowel. The result counter is incremented on the transition from state 1 to state 2. of "amount" is 2. since this transition only occurs after a vowel-consonant pair. which is what we are counting.. If the first letter is a consonant or y. register int state. /* /* WordSize of the word */ */ /* in: word having its WordSize taken */ current state in machine file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. In state 1. then it changes to state 2. { register int result. of "any" is 1.

state = (IsVowel(*word)) ? 1: 2.htm (34 of 42)7/3/2004 4:20:13 PM .. /* Run a DFA to computer the word size */ while ( EOS != *word ) { switch ( state ) { case 0: break. state = 0. case 1: state = (IsVowel(*word)) ? 1 : 2. break. } word++.ooks_Algorithms_Collection2ed/books/book5/chap08. case 2: break. / *FN************************************************************************ ContainsVowel( word ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. if ( 2 == state ) result++. } return( result ).Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS result = 0. } /* WordSize */ state = (IsVowel(*word) | | ('y' == *word)) ? 1 : 2.

.. or any of its other letters are "aeiouy". Plan: Obviously. FALSE (0) otherwise. where a vowel is one of "aeiou" or y with a consonant in front of it. /*FN********************************************************************** file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* { if ( EOS == *word FALSE )."aeiouy")) ). Purpose: Some of the rewrite rules apply only to a root containing a vowel.htm (35 of 42)7/3/2004 4:20:13 PM .Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS Returns: int -. ) in: buffer with word checked */ None return( else return ( IsVowel(*word) } /* ContainsVowel */ | | (NULL != strpbrk(word+1. a word contains a vowel iff either its first letter is one of "aeiou". Notes: **/ static int ContainsVowel( word ) register char *word.TRUE (1) if the word parameter contains a vowel. The plan is to test this condition. under the definition of a vowel.ooks_Algorithms_Collection2ed/books/book5/chap08.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.TRUE (1) if the current word ends with a consonant-vowel-consonant combination.1. None ( (length = strlen(word)) < 2 ) return( FALSE ). *end--)) *end--)) *end )) ) /* consonant */ /* vowel */ /* consonant */ && (NULL != strchr("aeiouy".htm (36 of 42)7/3/2004 4:20:13 PM . or y. && (NULL == strchr("aeiou".ooks_Algorithms_Collection2ed/books/book5/chap08..Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS EndsWithCVC( word ) Returns: int -. Plan: Notes: **/ static int EndsWithCVC( word ) register char *word. Purpose: Some of the rewrite rules apply only to a root with this characteristic.. and the second consonant is not w. return( (NULL == strchr("aeiouwxy". else { end = word + length . x. { int length. if /* for finding the last three characters */ /* in: buffer with the word checked */ Look at the last three characters. FALSE (0) otherwise.

TRUE (1) if the current word meets special conditions for adding an e. Purpose: Plan: Notes: **/ static int AddAnE( word ) register char *word. Check for size of 1 and a consonant-vowel-consonant ending. } /* AddAnE */ / *FN************************************************************************ RemoveAnE( word ) Returns: int -. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap08. Purpose: Rule 502 applies only to a root with this characteristic.htm (37 of 42)7/3/2004 4:20:14 PM ..Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS } } /* EndsWithCVC */ /*FN*********************************************************************** AddAnE( word ) Returns: int -.TRUE (1) if the current word meets special conditions Rule 122 applies only to a root with this characteristic.. None for removing an e. { return( (1 == WordSize(word)) && EndsWithCVC(word) ).

Notes: This is the main routine driving the stemmer.htm (38 of 42)7/3/2004 4:20:14 PM .ooks_Algorithms_Collection2ed/books/book5/chap08. If a rule fires. 0 is none is fired Apply a set of rules to replace the suffix of a word Loop through the rule set until a match meeting all conditions is found.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS Plan: Notes: **/ static int Check for size of 1 and no consonant-vowel-consonant ending. } /* RemoveAnE */ / *FN************************************************************************ ReplaceEnd( word. { return( (1 == WordSize(word)) && !EndsWithCVC(word) ). None RemoveAnE( word ) register char *word. return its id. Connditions on the length of the root are checked as part of this function's processing because this check is so often made. if the root of the word is long enough.. otherwise return 0.the id for the rule fired.. It goes through a set of suffix replacement rules looking for a match on the current suffix. rule ) Returns: Purpose: Plan: int -. When it finds one. and it meets whatever other conditions are file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

**/ static int ReplaceEnd( word. rule->new_end ).htm (39 of 42)7/3/2004 4:20:14 PM /* in/out: buffer with the stemmed word */ /* in: data structure with replacement rules */ /* set to start of possible stemmed suffix */ /* save replaced character when testing */ . end = ending + rule->new_offset. *ending = EOS. rule ) register char *word.rule->old_offset. { register char *ending.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS required. and the function returns. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. RuleList *rule.. char tmp_ch..rule->old_end) ) { tmp_ch = *ending. if ( rule->min_root_size < WordSize(word) ) if ( !rule->condition | | (*rule->condition)(word) ) { (void)strcat( word. while ( 0 != rule->id ) { ending = end . if ( word = ending ) if ( 0 == strcmp(ending.ooks_Algorithms_Collection2ed/books/book5/chap08. then the suffix is replaced.

.. "An Algorithm For Suffix Stripping." file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. TRUE (1) otherwise Purpose: Plan: Stem a word Part 1: Check to ensure the word is all alphabetic Part 2: Run through the Porter algorithm Part 3: Return an indication of successful stemming Notes: This function implements the Porter stemming algorithm. } /* ReplaceEnd */ /************************************************************************/ /********************* Public Function Declarations ********************/ / *FN************************************************************************ Stem( word ) Returns: int -. } return( rule->id ). M.htm (40 of 42)7/3/2004 4:20:14 PM . } rule++.FALSE (0) if the word contains non-alphabetic characters and hence is not stemmed.. } *ending = tmp_ch.F. with a few additions here and there.Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS break.ooks_Algorithms_Collection2ed/books/book5/chap08. See: Porter.

Changes from the article amount to a few additions to the rewrite rules. Thus this function more or less faithfully refects the opaque presentation in the article. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap08. is taken directly from Porter's article. end++ ) if ( !isalpha(*end) ) return( FALSE ). Porter's algorithm is an ad hoc set of rewrite rules with various conditions on rule firing. /* Part 2: Run through the Porter algorithm */ ReplaceEnd( word. these are marked in the RuleList data structures with comments. step1a_rules ). pp.. July 1980. { int rule.htm (41 of 42)7/3/2004 4:20:14 PM .Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS Program 14 (3). The terminology of "step 1a" and so on. **/ int Stem( word ) register char *word. end--. /* which rule is fired in replacing an end */ /* in/out: the word stemmed */ /* Part 1: Check to ensure the word is all alphabetic */ for ( end = word. *end != EOS. 130-137. which unfortunately gives almost no justification for the various steps..

if ( (106 == rule) | | (107 == rule) ) ReplaceEnd( word. step4_rules ). ReplaceEnd( word. step2_rules ). ReplaceEnd( word. /* Part 3: Return an indication of successful stemming */ return( TRUE ). step5b_rules ). step1c_rules ).htm (42 of 42)7/3/2004 4:20:14 PM .Information Retrieval: CHAPTER 8: STEMMING ALGORITHMS rule = ReplaceEnd( word.ooks_Algorithms_Collection2ed/books/book5/chap08. ReplaceEnd( word. step1b_rules ). step1b1_rules ). step5a_rules ). } /* Stem */ Go to Chapter 9 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. ReplaceEnd( word. ReplaceEnd( word. ReplaceEnd( word. step3_rules )..

on thesaurus construction by merging existing thesauri. and the second. the thesaurus serves to coordinate the basic processes of indexing and document retrieval. while retrieval refers to the search process by which relevant items are identified.. which takes the user from a form that is not part of file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. This thesaurus is designed for the INSPEC domain.) In indexing. where a term is either a single word or a phrase. Programs written in C language accompany the discussion of these approaches. Figure 9.1. In both indexing and retrieval. along with the relationships between them. This should allow the reader to differentiate between thesauri. as well as computers and control.htm (1 of 114)7/3/2004 4:20:34 PM . Given this objective. It provides a common. These two methods were selected since they rely on statistical techniques alone and are also significantly different from each other. In IR systems. In addition to this hierarchical arrangement. the printed INSPEC thesaurus also includes an alphabetical listing of thesaural terms. and controlled vocabulary which assists in coordinating indexing and retrieval." The "see also" link leads to cross-referenced thesaural terms. magazines. "computer-aided instruction" is used for the alternative "teaching machines. a brief overview of the manual thesaurus construction process is given.. The thesaurus is logically organized as a set of hierarchies. NT suggests a more specific thesaural term. memoranda. Additionally. it is clear that thesauri are designed for specific subject areas and are therefore domain dependent. UF is utilized to indicate the chosen form from a set of alternatives. RT signifies a related term and this relationship includes a variety of links such as part-whole and object-property. The IR thesaurus typically contains a list of terms. precise. (The term document is used here generically and may refer to books.1 displays a short extract from an alphabetical listing of thesaurus terms and their relationships in the INSPEC thesaurus.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION CHAPTER 9: THESAURUS CONSTRUCTION Padmini Srinivasan University of Iowa Abstract Thesauri are valuable structures for Information Retrieval systems. which covers physics. a succinct representation of the document is derived. The first is on thesaurus construction from collections of documents." The converse of UF is USE. electrical engineering. In the above example. and also software. The former is designed to assist the writer in creatively selecting vocabulary. electronics. "computer-aided instruction" is a part of the hierarchy whose root note or top term (TT) is "computer applications. a thesaurus may be used to select the most appropriate terms.ooks_Algorithms_Collection2ed/books/book5/chap09. Each hierarchy is built from a root term representing a high-level concept in the domain. used to retrieve potentially relevant documents from large collections. BT provides a more general thesaural term.1 INTRODUCTION Roget's thesaurus is very different from the average thesaurus associated with Information Retrieval (IR) systems accessing machine readable document databases. A thesaurus provides a precise and controlled vocabulary which serves to coordinate document indexing and document retrieval. 9. the thesaurus can assist the searcher in reformulating search strategies if required. Next. Two major approaches for automatic thesaurus construction have been selected for detailed examination. This chapter first examines the important features of thesauri. articles. similarly. letters. In Figure 9.

" CC and FC are from the INSPEC classification scheme and indicate the subject area in which the term is used. There exists a vast literature on the principles. manual thesaurus construction is a highly conceptual file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (2 of 114)7/3/2004 4:20:34 PM . The indexer is typically instructed to select the most appropriate thesaural entries for representing the document. equivalent. these methods are strongly motivated by statistics. the user can employ the thesaurus to design the most appropriate search strategy. But. the reader can be assured that some automatic methodologies have emerged over the last few decades. In addition. such query modifications can also be more system initiated than user initiated. In searching. However. and associative ones.. This is because manual thesauri are highly complex structures which exhibit a variety of relationships including hierarchical. the thesaurus can be used to expand the query by following the various links between terms. In contrast. nonhierarchical. the thesaurus can suggest more specific search vocabulary. marked by an abundance of manually generated thesauri. The automatic detection of such relationships continues to be challenging. In fact. This mirrors the current state of art. cesium USE caesium computer-aided instruction see also education UF BT TT RT teaching machinse educational computing computer applications education teaching CC FC C7810C 7810Cf Figure 9. If the search does not retrieve enough documents. Similarly.ooks_Algorithms_Collection2ed/books/book5/chap09. this obviously requires rather complex algorithms since the system has to know not only how to reformulate the query but also when to do so. and problems involved in thesaurus construction. However.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION the vocabulary to a valid alternative. In this way the thesaurus can be valuable for reformulating search strategies. there is even much skepticism over the possibility of fully automating this procedure.1: A short extract from the 1979 INSPEC thesaurus A carefully designed thesaurus can be of great value. methodologies. As in most other subfields of IR. It is quite common to provide on-line thesauri. only a small portion is devoted to the "automatic" construction of thesauri. The example shows that "caesium" is the preferred form over "cesium. if the search retrieves too many items.. which simplifies the query reformulation process.

4 includes a procedure for automatic phrase construction. as illustrated by the following well known example. It should be recognized that the higher the level of coordination. describes features of thesauri. phrases are constructed while searching. the search for alternative automatic methods is definitely of value. Section 9. the disadvantage is that the searcher has to be aware of the phrase construction rules employed. the greater the precision of the vocabulary but the larger the vocabulary size. The advantage in precoordination is that the vocabulary is very precise. Therefore. The level of coordination is important as well.4 and 9. 9.5 focus on two major approaches for automatic thesaurus construction.2 FEATURES OF THESAURI Some important features of thesauri will be highlighted here. it is insufficient to state that two thesauri are similar simply because they follow precoordination. Two distinct coordination options are recognized in thesauri: precoordination and post-coordination.ooks_Algorithms_Collection2ed/books/book5/chap09. differences between the two are also identified where appropriate. irrelevant items may also be retrieved. A more likely example is "library school" and "school library. A precoordinated thesaurus is one that can contain phrases. 9. The objective is for the reader to be able to compare thesauri.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION and knowledge-intensive task and therefore also extremely labor intensive. 9.1 Coordination Level Coordination refers to the construction of phrases from individual terms. the thesaurus becomes more complex. Thesauri can adopt an intermediate level of coordination by allowing both phrases and single words. commonly accepted phrases become part of the vocabulary. Section 9. A postcoordinated thesaurus does not allow phrases. even within this group there is significant variability in terms of coordination level. However. the discussion applies to both manually and automatically generated thesauri. The advantage in postcoordination is that the user need not worry about the exact ordering of the words in a phrase. The rest of this chapter is organized into six sections. Precoordination is more common in manually constructed thesauri. However. while others may emphasize even larger sized phrases. Consequently.htm (3 of 114)7/3/2004 4:20:34 PM . Automatic phrase construction is still quite difficult and therefore automatic thesaurus construction usually implies post-coordination. However.2.6. For a more detailed discussion. Some thesauri may emphasize two or three word phrases. Also.2 Term Relationships Term relationships are the most important aspect of thesauri since the vocabulary connections they provide are file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The three C programs included with this chapter are described briefly in section 9." The problem is that unless search strategies are designed carefully. This is typical of manually constructed thesauri.. In general. Therefore. Consequently.2. from Salton and McGill (1983): the distinction between phrases such as "Venetian blind" and "blind Venetian" may be lost. phrases are available for indexing and retrieval. thus reducing ambiguity in indexing and in searching.3 introduces the different approaches for automatic thesaurus construction following a brief description of manual thesaurus construction. Sections 9. The next. Instead. The disadvantage is that search precision may fall. It also implies an increase in the number of relationships to be encoded. Phrase combinations can be created as and when appropriate during searching. section 9. please consult Aitchison and Gilchrist (1972). The choice between the two options is difficult..2. The last section presents the conclusions.

popular and local usage. thing--attribute such as "rose" and "fragance. Parts and wholes include examples such as set--element." Nonhierarchical relationships also identify conceptually related terms. Also." which have significant overlap in meaning. such as "dog" and "german shepherd. However. especially by algorithms that exploit only the statistical relationships between terms as exhibited in document collections. the terms "harshness" and "tenderness.4 Specificity of Vocabulary Specificity of the thesaurus vocabulary is a function of the precision associated with the component terms. Collocation relates words that frequently co-occur in the same phrase or sentence. Paradigmatic relations relate words that have the same semantic core like "moon" and "lunar" and are somewhat similar to Aitchison and Gilchrist's quasisynonymy relationship. the semantics of each instance of a homograph can only be contextually deciphered. and (3) nonhierarchical relationships. Aitchison and Gilchrist (1972) specify three categories of term relationships: (1) equivalence relationships.. The concomitant disadvantage is that the size of the vocabulary grows since a large number of terms file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. There are many examples including: thing--part such as "bus" and "seat". However. some work has been done in this direction as in Fox (1981) and Fox et al. this is hard to achieve automatically.3 Number of Entries for Each Term It is in general preferable to have a single entry for each thesaurus term. However. 9.htm (4 of 114)7/3/2004 4:20:34 PM . at least some of these semantic relationships are commonly included in manual thesauri. The problem is that multiple term entries add a degree of complexity in using the thesaurus-especially if it is to be used automatically. Identifying these relationships requires knowledge of the domain for which the thesaurus is being designed. Taxonomy and synonymy are self-explanatory and refer to the classical relations between terms. Equivalence relations include both synonymy and quasi-synonymy.. (3) paradigmatic relations. superseded terms. and (5) antonymy relations.2. Synonyms can arise because of trade names. It should be noted that the relative value of these relationships for retrieval is not clear. count--mass.2. and the like. 9." which represent different viewpoints of the same property continuum. "genetics" and "heredity. Instead.1 illustrates all three categories. Also." Wang. Typically the user has to select between alternative meanings. Moreover. Vandendorpe. bonds (chemical) and bonds (adhesive). A highly specific vocabulary is able to express the subject in great depth and detail. (4) taxonomy and synonymy. This also allows each homograph entry to be associated with its own set of relations.ooks_Algorithms_Collection2ed/books/book5/chap09. (2) hierarchical relationships. However. it is more realistic to have a unique representation or entry for each meaning of a homograph. We do not provide an exhaustive discussion or listing of term relationships here. In a manually constructed thesaurus such as INSPEC. (2) collocation relations. The example in Figure 9. This promotes precision in retrieval. Many kinds of relationships are expressed in a manual thesaurus. (1988). A typical example of a hierarchical relation is genus-species. for example.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION most valuable for retrieval. These are semantic in nature and reflect the underlying conceptual interactions between terms. Quasi-synonyms are terms which for the purpose of retrieval can be regarded as synonymous. this is seldom achieved due to the presence of homographs--words with multiple meanings. this problem is resolved by the use of parenthetical qualifiers. and Evens (1985) provide an alternative classification of term relationships consisting of: (1) parts--wholes. we only try to illustrate the variety of relationships that exist. as in the pair of homographs. it should be clear that these statistical associations are only as good as their ability to reflect the more interesting and important semantic associations between terms. Most if not all of these semantic relationships are difficult to identify by automatic methods. Therefore.

The obvious disadvantage is that. in order to be used effectively. These constraints are imposed to ensure that the probability of a match between a query and a document is the same across classes. A simple rule is that terms should be in noun form. seldom involving more than stoplist filters and stemming.. acronyms. Salton and McGill (1983. transliteration. a limited number of adjectives should be used.. this feature can be regarded both as a weakness and a strength.5 Control on Term Frequency of Class Members This has relevance mainly for statistical thesaurus construction methods which work by partitioning the vocabulary into a set of classes where each class contains a collection of equivalent terms.) However.. 77-78) have stated that in order to maintain a good match between documents and queries.1 Manual Thesaurus Construction The process of manually constructing a thesaurus is both an art and a science.6 Normalization of Vocabulary Normalization of vocabulary terms is given considerable emphasis in manual thesauri. Given the growing abundance of large-sized document databases. specific terms tend to change (i. as discussed previously.2. and the specificity across classes should also be the same.2. normalization rules in automatic thesaurus construction are simpler. In other words. Therefore. terms within the same class should be equally specific. Also. (Appropriate frequency counts for this include: the number of term occurrences in the document collection. initials. 9. The advantage in normalizing the vocabulary is that variant forms are mapped into base expressions. abbreviations. evolve) more rapidly than general terms. This is certainly nontrivial and often viewed as a major hurdle during searching (Frost 1987). it is necessary to ensure that terms included in the same thesaurus class have roughly equal frequencies.3 THESAURUS CONSTRUCTION 9.e. the number of documents in the collection in which the term appears at least once). In other words. 9. Also.htm (5 of 114)7/3/2004 4:20:34 PM . the ordering of terms within phrases. We present here only a brief file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. thereby bringing consistency to the vocabulary. 9. In contrast. high specificity implies a high coordination level which in turn implies that the user has to be more concerned with the rules for phrase construction. These different features of thesauri have been presented so that the reader is aware of some of the current differences between manual and automatic thesaurus construction methods. and punctuation. As a result. This section also gives an idea of where further research is required. All features are not equally important and they should be weighed according to the application for which the thesaurus is being designed. the total frequency in each class should also be roughly similar. A second rule is that noun phrases should avoid prepositions unless they are commonly known. the user has to be well aware of the normalization rules used. manual thesauri are designed using a significant set of constraints and rules regarding the structure of individual terms.ooks_Algorithms_Collection2ed/books/book5/chap09. (These are discussed in more detail later. There are other rules to direct issues such as the singularity of terms. Further. There are extensive rules which guide the form of the thesaural entries. capitalization.3. Further. such vocabularies tend to require more regular maintenance. spelling. the user does not have to worry about variant forms of a term. it is indeed important to be challenged by the gaps between manual and automatic thesauri.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION are required to cover the concepts in the domain.

although it is based on tools from expert systems and does not use statistical methods. These terms and their relationships are then organized into structures such as hierarchies. Once the domain. Typically both the alphabetical and hierarchical arrangements are provided in a thesaurus. is a standard one.ooks_Algorithms_Collection2ed/books/book5/chap09. We also discuss a third automatic approach which is quite novel and interesting. typically a thesaurus evolves with time and slowly responds to changes in the terminology of the subject.) Consequently. Following this. A variety of sources may be used for this including indexes.) Boundary definition includes identifying central subject areas and peripheral ones since it is unlikely that all topics included are of equal importance. Once this is completed. the thesaurus should reflect any changes in the terminology of the area. on merging existing thesauri. each term is analyzed for its related vocabulary including synonyms.. handbooks. The first approach. the desired characteristics of the thesaurus have to be identified. Special problems arise in incorporating terms from existing thesauri which may for instance have different formats and construction rules.. as the previous section has shown. Subject experts and potential users of the thesaurus should also be included in this step. It is a long process that involves a group of individuals and a variety of resources. Updates are typically slow and again involve several individuals who regularly review and suggest new and modified vocabulary terms as well as relationships. Also. on designing thesauri from document collections.2 Automatic Thesaurus Construction In selecting automatic thesaurus construction approaches for discussion here. the updated thesaurus must also retain the older information. possibly within each subarea. encyclopedias. and sometimes also definitions and scope notes. Since manual thesauri are more complex structurally than automatic ones. The second.htm (6 of 114)7/3/2004 4:20:34 PM . they should use purely statistical techniques. The above informal description is very sketchy. has been sufficiently defined. the two major approaches selected here have not necessarily received equal attention in the literature. the collection of terms for each subarea may begin. textbooks. We first present a brief overview of each approach and then provide detailed descriptions.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION overview of this complex process. suggest new relationships between terms. Now. the entire thesaurus will have to be reviewed (and refined) to check for consistency such as in phrase form and word form. as well as any existing and relevant thesauri or vocabulary systems. the domain is generally partitioned into divisions or subareas. After the initial vocabulary has been identified. The process of organizing the vocabulary may reveal gaps which can lead to the addition of terms. 9. and reduce the vocabulary size. Therefore. (The alternative is to use linguistic methods. Programs included with this chapter are based on these two major approaches. with its subareas. bring together synonyms that were not previously recognized. the next problem is that it needs to be maintained in order to ensure its continued viability and effectiveness. thesauri are built using information obtained from users. First. Once the thesaurus has been designed and implemented for use within a retrieval system. journal titles and abstracts. since the boundaries are taken to be those defined by the area covered by the document database. one has to define the boundaries of the subject area. there are more decisions to be made. (In automatic construction. Once the initial organization has been completed. the manually generated thesaurus is ready to be tested by subject experts and edited to incorporate their suggestions. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. this step is simple. the criteria used are that they should be quite different from each other in addition to being interesting. identify the need for new levels in the hierarchies. In this approach. catalogues.3. broader and narrower terms. The problem is that since the older documents are still within the system. is better known using manual methods. That is. At this stage the hierarchically structured thesaurus has to be "inverted" to produce an alphabetical arrangement of entries--a more effective arrangement for use.

Therefore. This approach has been discussed at some length in Forsyth and Rada (1986). If a new database can indeed be served by merging two or more existing thesauri. Feedback is also required to select the specific type of relationship between two terms once it has been decided that the terms are indeed related.5 User Generated Thesaurus In this third alternative. User feedback is included to resolve any ambiguities and uncertainties. then a merger perhaps is likely to be more efficient than producing the thesaurus from scratch.. TEGEN is designed using production rules which perform a detailed analysis of the user's search pattern. The third program (merge. MeSH stands for Medical Subject Headings and is the thesaurus used in MEDLINE. is a detailed thesaurus developed by the College of American Pathologists for use in hospital records. The challenge is that the merger should not violate the integrity of any component thesaurus. SNOMED. Given that this approach utilizes mainly expert system methodologies. This assumes that a representative body of text is available. (1988). the type of query modification performed by the user. Rada has experimented with augmenting the MeSH thesaurus with selected terms from SNOMED (Forsyth and Rada 1986. a medical document retrieval system.3 From a Collection of Document Items Here the idea is to use a collection of documents as the source for thesaurus construction. MeSH terms are used to describe documents. 216). while SNOMED terms are for describing patients. The objective is to capture this knowledge from the user's search. This is the basis of TEGEN--the thesaurus generating system designed by Guntzer et al.3. the idea is that users of IR systems are aware of and use many term relationships in their search strategies long before these find their way into thesauri. The procedure involves examining the types of Boolean operators used between search terms.c. The idea is to apply statistical procedures to identify important terms as well as their significant relationships. 9. select. constructed and maintained by the National Library of Medicine. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Ideally.3. their approach requires considerable interaction with the user population. and so on.. included with this chapter are based on this approach of designing a thesaurus from a collection of documents. It is reiterated here that the central thesis in applying statistical methods is to use computationally simpler methods to identify the more important semantic knowledge for thesauri.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION 9. The first two programs. no representative program is included here.3.htm (7 of 114)7/3/2004 4:20:34 PM . Both MeSH and SNOMED follow a heirarchical structure. 9. It is semantic knowledge that is used by both indexer and searcher. statistical methods will continue to be used. Until more direct methods are known. It provides a sophisticated controlled vocabulary for indexing and accessing medical documents.c and hierarchy. which stands for Systematized Nomenclature of Medicine.c) included here implements two different merging algorithms adapted from Rada's work.4 By Merging Existing Thesauri This second approach is appropriate when two or more thesauri for a given subject exist that need to be merged into a single unit. They propose TEGEN as a viable alternative technique for automatic thesaurus construction. Rada's focus in his experiments has been on developing suitable algorithms for merging related but separate thesauri such as MeSH and SNOMED and also in evaluating the end products.ooks_Algorithms_Collection2ed/books/book5/chap09. Work by Soergel (1974) is relevant to this point since it includes an interesting discussion on the various semantic interpretations of significant statistical associations between words. a patient can be completely described by choosing one or more terms from each of several categories in SNOMED.

to create interesting phrases for a higher coordination level as discussed in the section on phrase construction." The chapter in this book on stemming algorithms may be consulted for this. includes routines for all three methods. For this. while high-frequency terms are too general and negatively impact search precision. (3) Organization of vocabulary: Here the selected vocabulary is organized. The first step is to identify an appropriate document collection. This initial set of vocabulary terms is now ready for normalization." "informing. The only loosely stated criteria is that the collection should be sizable and representative of the subject area. It also includes phrase construction depending on the coordination level desired. Salton recommends creating term classes for the low-frequency terms. Terms can be selected from titles. Stemming reduces each word into its root form. the terms "information. Program select.. on the basis of the associations computed in step 2." and "informed" could all be stemmed into the same root "inform.4. The article by Fox (1989-1990) on the construction of stoplists may be useful here. more general terms can be sought. which are described briefly below. Threshold frequencies are generally not fixed and therefore user specified.1 Construction of Vocabulary The objective here is to identify the most informative terms (words and phrases) for the thesaurus vocabulary from document collections. If high specificity is needed. The simplest and most common normalization procedure is to eliminate very trivial words such as prepositions and conjunctions. Threshold frequencies are specified by the user via the global variables LOW_THRESHOLD and file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. (3) selection of terms based on the Poisson model. High-frequency terms are generally coordinated into phrases to make them more specific (see the later section on phrase construction for this). The next standard normalization procedure is to stem the vocabulary. The ones we discuss here are: (1) selection of terms based on frequency of occurrence. The basic idea is that each term may be placed in one of three different frequency categories with respect to a collection of documents: high. see for example. Terms in the mid-frequency range are the best for indexing and searching. it is not evident how to do this automatically. or even the full text of the documents if available. however. Stem evaluation and selection There are a number of methods for statistically evaluating the worth of a term.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION 9. Terms in the low-frequency range have minimal impact on retrieval. Salton and McGill (1983).htm (8 of 114)7/3/2004 4:20:34 PM . otherwise. (2) selection of terms based on Discrimination Value.4 THESAURUS CONSTRUCTION FROM TEXTS The overall process may be divided into three stages: (1) Construction of vocabulary: This involves normalization and selection of terms.ooks_Algorithms_Collection2ed/books/book5/chap09. Selection by Frequency of Occurence: This is one of the oldest methods and is based on Luhn's work.c. and low frequency. The next step is to determine the required specificity for the thesaurus. The resulting pool of stems is now ready to be analyzed statistically with two objectives: first.. then the emphasis will be on identifying precise phrases. and second.c includes a routine for this selection method. to select the most interesting stems as discussed in the following section. generally into a hierarchy. an appropriate stoplist of trivial words needs to be constructed. Program hierarchy. For example. abstracts. (2) Similarity computations between terms: This step identifies the significant statistical associations between terms. which has been extensively discussed in the literature. 9.c implements one method for organizing the vocabulary in the third stage. medium. The first program select.c implements procedures for stages 1 and 2. Program select.

the one significant result is that trivial words have a single Poisson distribution.(Average similarity with k) Good discriminators are those that decrease the average similarity by their presence. In the program. The discrimination value (DV) for the term is then computed as: DV(k) = (Average similarity without k) .. a more efficient algorithm such as that based on the cover coefficient concept may be tried as suggested by Can and Ozkarahan (1987). The overall procedure is to compute the average interdocument similarity in the collection. the similarity between every document and this centroid is calculated. Phrase construction This step may be used to build phrases if desired. The Poisson distribution is a discrete random distribution that can be used to model a variety of random phenomena including the number of typographical errors in a page of writing and the number of red cars on a highway per hour. In all the research that has been performed on the family of Poisson models. Next.. while neutral discriminators have no effect on average similarity. this is not done here.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION HIGH_THRESHOLD. Program select. Terms that are positive discriminators can be included in the vocabulary and the rest rejected. Terms whose distributions deviate significantly are selected to be in the vocabulary. while the distribution of nontrivial words deviates significantly from that of a Poisson distribution. using some appropriate similarity function.c program prints out the distributions in a collection for all terms in the inverted file. However. Two methods are described below.c includes a routine called dv-all which computes DV for all terms with respect to a collection. The more discriminating a term. Program select. The get-Poisson-dist routine in the select. Next. Also. The algorithm used is a straightforward one using the method of centroids. In this method the average interdocument similarity is computed by first calculating the average document vector or centroid vector for the collection. it is not implemented file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (9 of 114)7/3/2004 4:20:34 PM .ooks_Algorithms_Collection2ed/books/book5/chap09. The reader is referred to Harter (1975) for information on these chi-square comparisons. Harter (1975). As mentioned before. Two distributions are produced for each term: the actual distribution and the one expected under the assumption of a single Poisson model. Given insufficient details for the second method. the term k being evaluated is removed from the indexing vocabulary and the same average similarity is recomputed. c includes routines for the first method. Poor discriminators have negative DV. This is performed by a routine called centroid. the higher its value as an index term. average similarity with all terms intact is referred to as the baseline. this decision is influenced by the coordination level selected. This generates the total similarity in the collection. and Srinivasan (1990) on the family of Poisson models. Selection by Discrimination Value (DV): DV measures the degree to which a term is able to discriminate or distinguish between the documents of the collection as described by Salton and Yang (1973). phrase construction can be performed to decrease the frequency of highfrequency terms and thereby increase their value for retrieval. that is. those for which DV is positive. The rest may be discarded. which can then be used to calculate the average similarity. These distributions will have to be compared using the chi-square test to identify any significant differences. This result is used here to select nontrivial words as thesaurus terms. Selection by the Poisson Method: This is based on the work by Bookstein and Swanson (1974). For large collections of documents.

for example. since it is an interesting approach. and size-factor is related to the size of the thesaurus vocabulary. some critical details are missing from their paper. the routine cohesion in select. If cohesion is above a second threshold. For pairs that qualify. and their frequency of occurrence should be sufficiently high. such as the same sentence. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (10 of 114)7/3/2004 4:20:34 PM . If this co-occurrence is lower than a threshold. such as the words should also appear within some specified distance. a sketchy procedure is included since the overall process looks rather interesting. Their procedure is a statistical alternative to syntactic and/or semantic methods for identifying and constructing phrases. (Any suitable contextual constraint such as the ones above may be applied in selecting pairs of terms.) 2. "artificial intelligence. Basically. tj) = size-factor * (co-occurrence-frequency/(total-frequency(ti) * total-frequency(tj))) (Salton and McGill. Choueka Procedure: The second phrase construction method is based on the work by Choueka (1988). 200). we include a sketchy algorithm. compute the cohesion value.ooks_Algorithms_Collection2ed/books/book5/chap09. we do not include an implementation of their approach. then do not consider the pair any further. a couple of general criteria are used. The second general requirement is that the component words should represent broad concepts. which is described below. 85) 4. page 200) COHESION (ti. These criteria motivate their algorithm. retain the phrase as a valid vocabulary phrase. Both ti and tj represent terms. COHESION (ti. However. 1. 3. Compute pairwise co-occurrence for high-frequency words.. the component words of a phrase should occur frequently in a common context.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION here." The algorithm proposed is statistical and combinatorial and requires a large collection (at least a million items) of documents to be effective. The author has been quite successful in identifying meaningful phrases and is apparently extending the algorithm. Salton and McGill Procedure: This is the standard one proposed in Salton and McGill (1983.c is an implementation of the Rada formula. However.. However. He proposes a rather interesting and novel approach for identifying collocational expressions by which he refers to phrases whose meaning cannot be derived in a simple way from that of the component words. tj) = co-occurrence-frequency/sqrt(frequency(ti) * frequency(tj)) (Rada. We do not include a program for this complete algorithm. 133-34) and adapted by Rada in Forsyth and Rada (1986. Therefore. More stringent contextual criteria are possible. First. Two formulas for computing cohesion are given below. Unfortunately.

4. articles.4. and phrases have been designed if necessary. 1979).htm (11 of 114)7/3/2004 4:20:34 PM . For example. Try to merge smaller expressions into larger and more meaningful ones. 9. evaluate any potential subexpressions such as a b c and b c d for relevance. This makes the criteria for similarity more stringent. (It is not clear from the paper how relevance is decided. the next step is to determine the statistical similarity between pairs of terms. Dice: which computes the number of documents associated with both terms divided by the sum of the number of documents associated with one term and the number associated with the other. but perhaps it is also based on frequency. 5.) 6. 9.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION 1. Discard any that are not sufficiently relevant. There are a number of similarity measures available in the literature.2 Similarity Computation Once the appropriate thesaurus vocabulary has been identified.4. Given an expression such as a b c d. the criteria used for allowing this should be carefully formulated. a b c d and b c d may merge to form a b c d. as suggested in Salton and McGill (1983) and Soergel (1974). Select the range of length allowed for each collocational expression.) The main difference between this procedure and the previous one is that this one considers phrases that have more than two words. a preset value). in a unit smaller than the entire document). Select. conjunctions. Choueka's procedure also allows phrases to be substituted by longer and more general phrases as in step 6. 2. The algorithms implemented in the program select. (Again. 3.c includes two similarity routines: 1. The trivial words include prepositions. It is of course possible to extend the previous procedure to include phrases with more than two words. An extensive study has been done on the comparison of different similarity measures (McGill et al.ooks_Algorithms_Collection2ed/books/book5/chap09. and so on.c can be made more accurate by including only those instances in each numerator wherein the two terms co-occur in the same sentence and within some specified distance (that is.3 Vocabulary Organization file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Either measure can be used to assess the similarity or association between terms in a document collection. 2. computing cohesion is likely to be more challenging than simply applying the formula recursively over larger sized phrases. pronouns. Delete sequences that begin or end with a trivial word.. the exact criteria for allowing a merger are not given. Delete expressions that contain high-frequency nontrivial words. However. Example: two to six words.. Cosine: which computes the number of documents associated with both terms divided by the square root of the product of the number of documents associated with the first term and the number of documents associated with the second. However. Build a list of all potential expressions from the collection with the prescribed length that have a minimum frequency (again.

In other words. then each becomes a parent of t. For this. Propagate such terms to level i by creating an identical "dummy" term as its child.c also includes appropriate data structures for storing thesauri organized as hierarchies. Identify a set of frequency ranges. Two different merging algorithms have been implemented. Perform steps 4 and 5 for each level starting with level 1. compute similarity between t and every term in level i-1. It should be noted there are major differences between available clustering algorithms. called complex-merge. Besides illustrating the procedure. is designed to merge different hierarchies. The first. There will be one term class for each frequency range. then the two words have similar meaning. Group the vocabulary terms into different classes based on their frequencies and the ranges selected in step 1. adopts a more interesting criteria.ooks_Algorithms_Collection2ed/books/book5/chap09. p and q (of varying frequencies) have the same shape. the last step is to impose some structure on the vocabulary which usually means a hierarchical arrangement of term classes. The highest frequency class is assigned level 0. a term is allowed to have multiple parents. links hierarchies wherever they have terms in common. The algorithm for the program has been adapted from Chapter 14 of Forsyth and Rada (1986) in which experiments in augmenting MeSH and SNOMED have been described. It links terms from different hierarchies if they are similar enough. 4. if p is the term with the higher frequency.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Once the statistical term similarities have been computed. then q becomes a child of p. hierarchy.c is quite different from the standard clustering algorithms and is based on the following assumptions: (1) high-frequency words have broad meaning. 6. while lowfrequency words have narrow meaning. 5. and (2) if the density functions of two terms. or when a new subject is being synthesized from existing ones. Some clustering algorithms can also generate hierarchies. 3. 200-01). called simple-merge. the next. For each term t in level i. such programs are not included here. This algorithm implemented in hierarchy.5 MERGING EXISTING THESAURI The third program. 9.htm (12 of 114)7/3/2004 4:20:34 PM . 2. This is useful when different thesauri (perhaps with different perspectives) are available for the same subject. After all terms in level i have been linked to level i-1 terms. level 1 and so on. 1. Given the chapter in this book on clustering. If more than one term in level i-1 qualifies for this. Term t becomes the child of the most similar term in level i-1 . similarity is computed as a function of the number of file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. merge.. any appropriate clustering program can be used.c. Parent-child links are determined between adjacent levels as follows. A standard clustering algorithm generally accepts all pairwise similarity values corresponding to a collection of objects and uses these similarity values to partition the objects into clusters or classes such that objects within a cluster are more similar than objects in different clusters.. As a consequence of these two assumptions. and the selection should be made after carefully studying their characteristics. These two assumptions motivate their entire procedure as outlined below. The second. check level i-1 terms and identify those that have no children. Instead. Here. we include the implementation of an alternative simple procedure to organize the vocabulary which is described in Forsyth and Rada (1986.

Both contain the same information. or for example be some function of the term's frequency of occurrence in the document. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 2 math 2. Also. This term-weight may be assigned manually. and MAXWDF: specifying the expected maximum frequency of occurrence for a term within a document.c This program contains a variety of routines for the various selection criteria used in designing the thesaurus vocabulary (see Appendix 9.htm (13 of 114)7/3/2004 4:20:34 PM .6 BRIEF DESCRIPTION OF C PROGRAMS INCLUDED Three programs are included at the end of this chapter. The term-weight represents the strength of association between the document and the term. The direct file is a listing of document numbers corresponding to the database of interest.1 Program Select. 9.2: Short extracts from both input files Besides these two files. 9. The other two parameters are: LOW-THRESHOLD AND HIGH-THRESHOLD.0 diabetes 1 2. which are used when partitioning the terms by frequency of occurrence in the collection into HIGH. Figure 9..A). The first two are MAXWORD: specifying the maximum number of characters in a term.0 1 diabetes 2. In both files. the two files should contain identical information.0 3 math 1. Here each term is linked to its associated document numbers and the term-weights.0 logic 2 1.6. and LOW frequency classes.0 Inverted file extract Figure 9. the second generates (or reads) and stores hierarchies. The user will have to specify four global variables.2 below shows a brief extract of both files. MEDIUM. The third program merges different hierarchies. The inverted file is a listing of terms. Each document number is associated with a set of term and term-weight pairs. One or more spaces may be used to distinguish between the three. The first can be used to select terms and to construct phrases.ooks_Algorithms_Collection2ed/books/book5/chap09.. It requires two input files: a direct file and an inverted file. sufficiency is decided based on a preset user specified threshold. The interpretation of term-weights should be the same in both input files. A document may be associated with all its component terms or perhaps only a select few.0 2 logic 1.0 math 3 1. The file is arranged such that the rows pertaining to the same document are grouped together.0 Direct file extract mellitus 1 1. The inverted index is arranged such that rows corresponding to a term are grouped together.0 1 mellitus 1. but arranged differently. document numbers are represented by integers.0 math 2 2.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION parent and child terms in common. the terms are character strings and the weights are decimal numbers. In fact. an output file must also be specified.

Mass. and D. "Probabilistic Models for Automatic Indexing. Two parameters will have to be set: MAXWORD described before and THRESHOLD which specifies the minimum similarity for use in the complex merge routine. Cambridge.C). A. SWANSON. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling. 609-23. A few related issues pertinent to thesauri have not been considered here: evaluation of thesauri.htm (14 of 114)7/3/2004 4:20:34 PM . Association for Computing Machinery. The second input file is a link file. For this. R. and NUMBER-OF-LEVELS. and A. which is the construction of thesauri..7 CONCLUSION This chapter began with an introduction to thesauri and a general description of thesaural features.6. First. 204-11. CAN. 1972..c This program can perform two major and separate functions (see Appendix 9. Paper presented at the Eighth International Conference on Research and Development in Information Retrieval.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION 9.3 Program merge. 25(5). 312-18. Y. these secondary issues will certainly be important in any realistic situation. 9.2. if given the hierarchical relationships between a set of terms. MIT. CHOUEKA. Thesaurus Construction -. which has the same structure as in Figure 9. London: ASLIB. A row consists of a parent term followed by any number of spaces and then a child term. Second. BOOKSTEIN. it is also capable of generating the hierarchical structure automatically using Rada's algorithm. and E. For this the user will have to set two parameters: MAXWORD. it records these relationships in its internal inverted file structure.. Looking for Needles in a Haystack OR Locating Interesting Collocational Expressions in Large Textual Databases. F.c This program contains routines to perform the two types of merging functions described (see Appendix 9. which constrains the size of the generated hierarchy. OZKARAHAN." J. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.A Practical Manual. 9. 1985. Their formats are as described before. Concepts of the Cover-Coefficient-Based Clustering Methodology.6.B). REFERENCES AITCHISON.. 1988. maintenance of thesauri. However. which specifies the maximum size for a term. the input required is the inverted file. American Society for Information Science. This can then be used for other purposes. 1974.. The focus has been on the central issue. GILCHRIST. and how to automate the usage of thesauri. This file is used if the link information is simply provided to the program. Two major automatic thesaurus construction methods have been detailed.2 Program Hierarchy. which is a sequence of rows representing link information. J. Four input files are required here: an inverted file and a link file for each thesaurus hierarchy.ooks_Algorithms_Collection2ed/books/book5/chap09.

Applications in Expert Systems and Information Retrieval. EVENS. "Automatic and Semi-Automatic Methods as an Aid in the Construction of Indexing Languages and Thesauri. P. VANDENDORPE. 1974. SRINIVASAN. 1981. 1975. Classif. 101-08. G.. Inverse Document Frequency and Discrimination Value Models of Document Representation. RADA. Paper presented at the Second Conference on Applied Natural Language Processing." J. and J. T." J. Syracuse. J. and R. 1979. Introduction to Modern Information Retrieval. 588-96. S. G. and M. WANG. SEEGMULLER. 1973. 1987. 15-27. C. NewYork: McGraw-Hill. Y-C. 29(4). Mass. O. 1988.. M. England: Ellis Horwood Series in Artificial Intelligence. J. MARKOWITZ. HARTER. "On the Specification of Term Values in Automatic Indexing. American Society for Information Science.ooks_Algorithms_Collection2ed/books/book5/chap09. SARRE. FOX. American Society of Information Science. Machine Learning -. E. AHLSWERE. P. 19-35. APPENDIX /* PROGRAM NAME: hierarky. M. 26(2) 269-78. 6. Project report. 351-72. 1985. FOX. R. 1986. 1990.c file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 34-39.." Information Processing and Management.. West Sussex. 197-206 and 280-89. SOERGEL. YANG. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling. and M. "A Comparison of Two-Poisson.. 21(1-2). "A Probabilistic Approach to Automatic Keyword Indexing. G." SIGIR Forum. U. Building A Large Thesaurus for Information Retrieval. NUTTER. Association for Computational Linguistics.. SALTON. 15(3). G. et al. and C. MCGILL. FROST. Cambridge. Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. MIT. JUTTNER. FALL 1989/Winter 1990. "Lexical Relations: Enhancing Effectiveness of Information Retrieval Systems. MCGILL. E. "A Stop List for General Text. D. SALTON. A.. FOX. A. "Subject Searching in an Online Catalog." Journal of Documentation." Information Technology and Libraries. 1(1). GUNTZER. S.." SIGIR Newsletter. 1983. C. "Relationship Thesauri in Information Retrieval." Intern.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION FORSYTH.htm (15 of 114)7/3/2004 4:20:34 PM . An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems.. 26. 60-63. New York: Syracuse University School of Information Studies. T. Parts I and II. and F. EVENS. 1988.

inverted file: sequences of document number weight.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION PURPOSE: 1) This program will generate a hierarchy in two ways. It can simply read the parent-child links from an input fifile and store the links in the inverted file structure. requires inverted file... OR 2) It can use the Rada algorithm which splits up words into different frequency groups and then builds links between them. used by the Rada algorithm file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (16 of 114)7/3/2004 4:20:34 PM . requires inverted file and link file. (multiple entries for any term should be grouped together) 2) links file: sequences of child term parent term NOTES: Filters such as stop lists and stemmers should be used before running this program.ooks_Algorithms_Collection2ed/books/book5/chap09. INPUT FILES REQUIRED: Option 1: Option 2: 1) term (Depends on the option selected). PARAMETERS TO BE SET BY USER: 1) 2) MAXWORD: identifies the maximum size of a term NUMBER_OF_LEVELS: specifies the desired number of levels in the thesaurus hierarchy to be generated.

.h> #include <math.htm (17 of 114)7/3/2004 4:20:34 PM . to parent term in inverted file /* ptr. to next doclist record */ */ */ /* ptr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION COMMAND LINE: hierarky (INPUT & OUTPUT FILES ARE SPECIFIED INTERACTIVELY) ***********************************************************************/ #include <stdio. struct parentlist *nextparent. struct childlist { char term[MAXWORD]. struct parentlist { char term[MAXWORD]. /* child term */ */ /* parent term */ */ /* maximum size of a term */ /* # of levels desired in the thesaurus /* sequences of document # and weight /* document number /* term weight in document /* ptr. to child term in inverted file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.h> #define MAXWORD 20 #define NUMBER_OF_LEVELS 10 */ struct doclist { pairs */ int doc. struct invert *parent. struct invert *child.h> #include <string.ooks_Algorithms_Collection2ed/books/book5/chap09.. } doclistfile. to next parentlist record */ /* ptr. } parentfile. float weight struct doclist *nextdoc.

to next invert record */ */ /* ptr. to last record in inverted file /* ptr. to first record in inverted file /* ptr. /* inverted file /* memory for records.ooks_Algorithms_Collection2ed/books/book5/chap09. name of */ static struct childlist *get_mem_childlist ( ). } childfile. to last document in doclist */ */ */ static char currentterm[MAXWORD]. */ static int Number_of_docs. /* tracks current term in inverted file /* total # of documents which is computed */ /* these 4 functions static struct invert *get_mem_invert ( ). to next childlist record */ /* inverted file /* term */ */ /* sequences of document # and weigh /* ptr. struct doclist *doc struct parentlist *parents. /* ptr. struct invert { char term[MAXWORD]. /* is indicated by the /* the function */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. to parent terms /* ptr. } invfile..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION struct childlist *nextchild.. struct invert *nextterm. struct childlist *children. struct invert *startinv struct invert *lastinv struct doclist *lastdoc. to child terms */ */ */ /* thesaurus level based on term frequency /* ptr. static FILE *input. int level. The type of */ static struct parentlist *get_mem_parentlist ( ). will obtain */ static struct doclist *get_mem_doclist ( ).htm (18 of 114)7/3/2004 4:20:34 PM .

total_wdf ( ). add_link ( ).. /* initialize the levels information */ */ generate_Rada_hierarchy ( ). struct invert *find_term ( ). and */ /* returns its address. lastinv = NULL.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static FILE *input1. { char ch. static float cohesion ( ). static void read_invfile ( ).. /* generate the Rada hierarchy get_term_data ( ). lastdoc = NULL. /* get basic information about terms */ /* searches for term in inverted file */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. fname[128]. write_levels ( ). get_freq_range ( ). /* link file /* holds any output */ */ /* compute cohesion between two terms */ /* compute total frequency of term in dbse. add_invert ( ).ooks_Algorithms_Collection2ed/books/book5/chap09. */ /* read in the inverted file /* read in the links file /* called within read_links ( ) /* print the inverted file /* called within read_invfile ( ) */ */ */ */ */ read_links ( ). startinv = NULL. static FILE *output. pr_invert ( ). int main (argc) int argc.htm (19 of 114)7/3/2004 4:20:34 PM .

printf ("To simply read links from a link file enter 1\n"). printf ("To quit enter 3\n"). switch (ch) { case '1': (void) (void) printf ("\nEnter name of inverted file: "). scanf ("%s". ch=getchar ( ). fname). fname). exit (1).. if (argc > 1) { (void) printf ("There is an error in the command line\n")..htm (20 of 114)7/3/2004 4:20:34 PM . printf ("To use Rada's algorithm to generate links enter 2\n"). "r") ) ==NULL) { (void) printf ("cannot open file %s\n". (void) printf ("Correct usage is:\n"). Number_of_docs = 0. printf ("Enter selection: "). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. if ( (input=fopen (fname. } (void) (void) (void) (void) (void) printf ("\nMake a selection\n"). exit (1).ooks_Algorithms_Collection2ed/books/book5/chap09. (void) printf ("hierarchy\n").Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION currentterm [ 0 ] = '\0'.

exit (1). (void) break. fname). ."r") ) == NULL) (void) printf ("cannot open file %s\n"."\nINVERTED FILE\n\n")."w") ) ==NULL) (void) printf ("cannot open file %s\n". { if ( (output=fopen (fname. (void) fclose (input1)..ooks_Algorithms_Collection2ed/books/book5/chap09. scanf ("%s".. pr_invert ( ). { if ( (input1=fopen (fname. (void) fclose (output).htm (21 of 114)7/3/2004 4:20:34 PM fclose (input). case '2': file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.fname). } read_invfile ( ). (void) fprintf (output. exit (1). (void) fprintf (output. read_links ( ). scanf ("%s". pr_invert ( ). fname)."\nINVERTED FILE WITH LINK INFORMATION\n\n"). fname).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } (void) (void) printf ("Enter name of link file: "). } (void) (void) printf ("Enter name of output file: ").

} read_invfile ( ).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) (void) printf ("\nEnter name of inverted file: "). exit (1)."\nINVERTED FILE\n\n"). pr_invert ( ).ooks_Algorithms_Collection2ed/books/book5/chap09. if ( (input=fopen (fname. fname). case '3': fclose (input). } (void) (void) printf ("Enter name of output file: "). (void) fclose (output).. fname). generate_Rada_hierarchy ( )."r") ) ==NULL) { (void) printf ("cannot open file %s\n". exit (1).htm (22 of 114)7/3/2004 4:20:34 PM . scanf ("%s". (void) break. fname). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. (void) fprintf (output. if ( (output=fopen (fname."\nINVERTED FILE AFTER GENERATING RADA HIERARCHY\n pr_invert ( )."w") ) ==NULL) { (void) printf ("cannot open file %s\n".. fprintf (output. scanf ("%s". (void) \n"). fname).

/* holds current document number /* holds current term /* holds current term weight /* structure to store doc#-weight pair /* */ */ */ */ r ad next void Read in the inverted file entries from the disk file (void) fscanf (input.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION exit (0). &docid."%s%d%f". line */ while (strlen (temp) > 0) */ /* while its found a legitimate term { if (!strncmp (currentterm.htm (23 of 114)7/3/2004 4:20:34 PM . } return (0). char temp [MAXWORD]. float weight. struct doclist *p.. temp. } /*********************************************************************** read_invfile ( ) Returns: Purpose: **/ static void read_invfile ( ) { int docid.ooks_Algorithms_Collection2ed/books/book5/chap09. MAXWORD) ) { /* if this term has previously been entered in inverted file then */ /* only need to attach a doclist record to the same entry */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. temp.. &weight).

p->doc = docid. } /* /* connect p to the doclist chain */ */ set this global variable else add_invert (docid..weight) Returns: Purpose: void Start a new entry in the inverted file. temp. temp. p->nextdoc = NULL.. &weight). **/ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.temp. /* get memory for doclist record /* assign document number /* assign term weight */ */ */ if (lastdoc) lastdoc-nextdoc = p. p->weight = weight.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION p = get_mem_doclist ( ). It is called in the fscanf (input.ooks_Algorithms_Collection2ed/books/book5/chap09."%s%d%f". for this term if it already exists lastdoc = p. /* else term is a brand new term & need to make a new inverted file entry */ temp [0] = ' \0 '. /* read next line read_invfile function when a new term is read from the input file. (void) */ } } /*********************************************************************** add_invert(docid. weight). &docid.htm (24 of 114)7/3/2004 4:20:34 PM .

ooks_Algorithms_Collection2ed/books/book5/chap09. temp. p->doc = get_mem-doclist ( ).. weight) int docid. lastdoc = p->doc. parent terms */ p->children = NULL. p->doc->doc = docid. to last inverted /* p will get attached to inverted /* in: document number /* in: new index term /* in: index term weight */ */ */ /* get memory for p /* copy over the term */ */ /* to begin this term has no /* also no child terms */ /* start a doclist structure */ /* assign document number /* assign term weight */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static void add_invert (docid. if (startinv = NULL) startinv = p. p->parents = NULL. then update global variable */ if (lastinv) lastinv-nextterm = p.htm (25 of 114)7/3/2004 4:20:34 PM . float weight. p->doc->weight = weight. MAXWORD). p->doc->nextdoc = NULL. { struct invert *p. /* update ptr. to last document */ /* update ptr. file record */ lastinv = p. /* if this is the first entry in inverted file. (void) strncpy (p-term. file */ p = get_mem_invert ( ).. char temp [MAXWORD]. p->nextterm = NULL. temp.

in */ /* the inverted file child[0] = '\0'.child).htm (26 of 114)7/3/2004 4:20:34 PM . MAXWORD). temp."%s%s".ooks_Algorithms_Collection2ed/books/book5/chap09. parent[0] = '\0'.parent.. while (strlen(parent) > 0) /* while a legitimate parent has been found { add_link(parent. current term */ } /* update global variable /* to the new term just added */ /********************************************************************** read_links( ) Returns: Purpose: void Read-parent child link information from a file and record links in the inverted record structure **/ static void read_links( ) { char parent[MAXWORD]."%s%s"..child). /* tracks parent term */ /* read input line */ */ /* tracks child term (void) fscanf(input1.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) strncpy(current term.parent. child */ /* */ /* this function will add the appropriate links */ now throw out the old parent & (void) fscanf(input1. child[MAXWORD]. /* read next input line */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.child).

parent). of child term in inv. of parent term in inv.child) Returns: Purpose: void. file /* in: holds the parent term /* in: holds the child term */ */ /* holds add. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. struct childlist *new_child. Notes: If a term in the link file is not in the inverted file then the program will give a suitable message and exit. **/ static void add_link(parent. Basically. for each parent-child link specified. it adds the appropriate link information into the inverted file. if /* structure used to store parent info.ooks_Algorithms_Collection2ed/books/book5/chap09.\n"). file */ */ */ struct parentlist *new_parent. /* holds add. /* structure used to store child info.htm (27 of 114)7/3/2004 4:20:34 PM . /* find address of parent term */ (!p) { printf("\nPlease check the input files.child) char parent[MAXWORD]. Used within read_links. */ *q. child[MAXWORD]..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } } /*********************************************************************** add_link(parent.. printf("\nParent term %s is not in the inverted file\n". { struct invert *p. p = find_term(parent).

/* get memory for childlist record */ /* copy over child term */ (void) strncpy(new_child->term. /* find address of child term */ if (!q) { printf("\nPlease check the input files.child.} /* first add parent links for given child */ new_parent = get_mem_parentlist( ). e. if (q->parents == NULL) { q->parents = new_parent. new_child->child = q.ooks_Algorithms_Collection2ed/books/book5/chap09. /* store address of child term in inverted file*/ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. \n").htm (28 of 114)7/3/2004 4:20:34 PM . Output may be incorrect. no parents listed for given child yet */ /* first parent link made */ /* at least 1 parent already listed for given child * new_parent->nextparent = q->parents. } else { /* store address of parent term in inverted file */ /* i.. printf("\nChild term %s is not in the inverted file\n".Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION exit(0). new_parent->nextparent = NULL.MAXWORD). } /* next add child links for given parent */ new_child = get_mem_childlist( ). /* get memory for parentlist record */ /* copy over parent (void) strncpy(new_parent->term. /* attach newparent to front of list */ q->parents = new_parent. term */ new_parent->parent = p. parent.child).. MAXWORD). exit(0).} q = find_term(child).

/* no children listed for given parent yet*/ /* first child link made */ new_child->nextchild = NULL.. p->children = new_child. record */ struct parentlist *parent_addr.ooks_Algorithms_Collection2ed/books/book5/chap09..htm (29 of 114)7/3/2004 4:20:34 PM . } else { /* at least 1 child already listed for given parent */ new_child->nextchild = p->children. It prints each term.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION if (p->children == NULL) { p->children = new_child. record */ struct doclist *doc_addr. } } /* attach newchild to front of list */ /*********************************************************************** pr_invert( ) Returns: Purpose: void Print the inverted file. **/ static void pr_invert( ) { struct invert *inv_addr. file /* tracks address of current doclist /* tracks address of current parentlist file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. record */ /* tracks address of current inv. its associated document numbers. term-weights and parent child terms.

and term weights */ %-30d ".5f\n"." doc_addr = inv_addr->doc.child_addr->term).doc_addr->doc).. "TERM: %s\nPARENT TERMS: ". /* loop through remaining documents */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD."\n\n").Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION struct childlist *child_addr. */ /* print all docs. inv_addr->term). child_addr = child_addr->nextchild. doc_addr = doc_addr->nextdoc. while (child_addr) { /* find addr. (void) fprintf(output. . (void) fprintf(output. } (void) fprintf(output. while (doc_addr) { (void) fprintf(output."\nCHILD TERMS: "). . of first parent /* printing all parents */ */ (void) fprintf(output. while (parent_addr) { /* find addr. */ */ (void) fprintf(output."%s ". } (void) fprintf(output. parent_addr = parent_addr->nextparent."%-10. doc_addr->weight). record */ inv_addr = startinv. parent_addr = inv_addr->parents." DOCUMENT NUMBER TERM /* loop through remaining children */ /* find addr.. parent_addr->term). child_addr = inv_addr->children.ooks_Algorithms_Collection2ed/books/book5/chap09. while (inv_addr) { /* tracks address of current childlist /* begin at top of inverted file /* while a legitimate term. of first associated doc. . "%s ".htm (30 of 114)7/3/2004 4:20:34 PM . of first child */ /* printing all children */ /*loop through remaining parents */ (void) fprintf(output.

of above term in inverted file */ /* tracks add..htm (31 of 114)7/3/2004 4:20:34 PM /* in: term for which total_wdf is required */ /* add. while (doc_ad) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* obtain address of the term in inv. struct doclist *doc_ad. { struct invert *term_addr. of associated doclist record */ /* tracks total wdf */ /* if term was found */ /* find address of associated doclist record */ . /* go to next inverted file entry */ } } /*********************************************************************** total_wdf(term) Returns: Purpose: float Compute total within document frequency for specified term in the database **/ static float total_wdf(term) char term[MAXWORD].ooks_Algorithms_Collection2ed/books/book5/chap09. file */ if (term_addr) { doc_ad = term_addr->doc.0. inv_addr = inv_addr->nextterm..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } (void) fprintf(output. float totalwdf."\n"). term_addr = find_term(term). totalwdf = 0.

} /*********************************************************************** get_freq_range(minimum.term). doc_ad = doc_ad-nextdoc. **/ static float get_freq_range(minimum. max.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION totalwdf = totalwdf + doc_ad-weight. *maximum. Could lead to problems\n". float freq.maximum) Returns: Purpose: float Compute the difference between the maximum total term frequency and the minimum total term frequency observed in the inverted file. /* begin at top of inverted file */ */ /* out: returns minimum totalwdf /* out: returns maximum totalwdf */ */ /* initialize min and max to equal frequency of 1st term in file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. return(totalwdf). { struct invert *inv_addr. }} /* loop through doclist reeords to */ /* compute the total weight */ else (void) fprintf(output.htm (32 of 114)7/3/2004 4:20:34 PM . "Term %s is not in the inverted file.ooks_Algorithms_Collection2ed/books/book5/chap09.. min. inv_addr = startinv. maximum) float *minimum.

Level 0 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } while (inv_addr { /* while a legitimate term compare with max and min. /* go to next term in inv. } *minimum = min. max = freq. file */ inv_addr = inv_addr->nextterm. 1.ooks_Algorithms_Collection2ed/books/book5/chap09. 2. .min). /* return(max . } returning the difference */ /*********************************************************************** write_levels( ) Returns: Purpose: void Write the level numbers for each term into the inverted file depending on the total wdf of the term in the database and the user selected parameter NUMBER_OF_LEVELS. file */ */ /* go to next term in inv..htm (33 of 114)7/3/2004 4:20:34 PM . freq = total_wdf(inv_addr->term).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION if (inv_addr) { freq = total_wdf(inv_addr->term). (freq > min) min = freq. etc. The level numbers are marked 0. *maximum = max. inv_addr = inv_addr->nextterm... if if (freq < max) max = freq. min = freq..

inv_addr = startinv. current_high.&high). number = NUMBER_OF_LEVELS. float low. level 1 the next frequency class etc.htm (34 of 114)7/3/2004 4:20:34 PM . current_low = low. /* counter through the different levels /* holds NUMBER_OF_LEVELS */ */ */ /* holds frequency of term in database /* holds diff. between highest & lowest freqs. high. while(inv_addr) { /* user specified global parameter */ /* start with the first term in inverted file */ /* while a legitimate term was found */ freq = total_wdf(inv_addr->term). current_low. i>=0. i. float range. */ range = get_freq_range(&low. /* tracks current inverted file record */ /* range holds the difference between highest & lowest totalwdf in dbse..ooks_Algorithms_Collection2ed/books/book5/chap09.-) { if (i == 0) current_high = high. freq. */ /* tracks lower frequency of current level */ /* tracks higher frequency of current level */ /* highest term frequency in database /* lowest term frequency in database */ */ struct invert *inv_addr. else current_high = current_low + (range/ number). **/ static void write_levels( ) { number. i.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION refers to the highest frequency class. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. for (i=(number-1).

.htm (35 of 114)7/3/2004 4:20:34 PM . int mark. file terms */ /* loop through the frequency levels */ inv_addr = inv_addr->nextterm.ooks_Algorithms_Collection2ed/books/book5/chap09. struct termlist *nextterm. } current_low = current_high.. break.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION /* if the term's frequency is within this narrow range. } /* ending for loop */ /* loop through other inv. then level = i */ if ((freq >= current_low) && (freq <= current_high)) { inv_addr->level = i. /* equals 1 if term is propagated else 0 */ /* pointer to next termlist record */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } } /*********************************************************************** generate_Rada_hierarchy ( ) Returns: Purpose: void Create the levelptrs data structure and generate the hierarchy according to Rada's algorithm **/ static void generate_Rada_hierarchy ( ) { struct termlist { /* pointer to term in inverted file */ struct invert *term.

ooks_Algorithms_Collection2ed/books/book5/chap09. *r. */ while (inv_addr) { */ /* start with first term in inverted file /* while there is a term there */ p = (struct termlist *)malloc(sizeof(termlistfile)). float coh. exit(1)."\nout of memory\n"). *q. /* levelptrs is an array of pointers. int i.. write_levels ( ). Each slot points to the start of the chain of termlist records for that level struct termlist *levelptrs[NUMBER_OF_LEVELS].i++) levelptrs[i] = NULL.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } termlistfile. termlist */ if (!p) { /* get memory for (void) fprintf(output. struct invert *inv_addr. each */ /* this routine computes and writes the level number for /* tracks current term in inverted file */ */ /* term in the inverted file * /* intializing the for (i=0. *p. max_cohesion.htm (36 of 114)7/3/2004 4:20:34 PM . } p-term = inv_addr.i < NUMBER_OF_LEVELS. array */ /* now create the termlist chain for each level inv_addr = startinv. file*/ /* assign the address of term in inverted file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..

/* initially term not linked */ /* Note: this term has been assigned to a frequency level already. else { /* now this is not the first term encountered for this level.i++) { p = levelptrs[i].. Now.ooks_Algorithms_Collection2ed/books/book5/chap09. so simply */ /* attach it to the front of the chain p->nextterm = levelptrs[inv_addr->level]. } level] = p. } inv_addr = inv_addr->nextterm. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.i < NUMBER_OF_LEVELS. levelptrs[inv_addr->level] = p. file term */ } /* end while */ /* process next inverted */ /* start with each level and compute max-cohesion with previous level */ for (i=1..0.htm (37 of 114)7/3/2004 4:20:34 PM .Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION p-mark = 0. if this */ /* is the first term read for this level then set the appropriate levelptrs entry*/ /* to point to this term if (levelptrs[inv_addr */ level] == NULL) { levelptrs[inv_addr p->nextterm = NULL. while (p) { max_cohesion = 0.

term */ while (q) { level . level i */ max_cohesion = 0. while (q && max_cohesion 0.q->term->term). /* routine adds the actual link */ /* to show that parent term has been linked */ /* end while(q) */ /* go to next term in p = p nextterm. if */ (coh == max_cohesion) { /* this ensures multiple links possible add_link(q->term->term. } q = q } nextterm.0) { coh = cohesion(p->term->term.ooks_Algorithms_Collection2ed/books/book5/chap09.q->term->term). q = q->nextterm.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION q = levelptrs[i-1].p->term->term). .. if (coh > max_cohesion) max_cohesion = coh.htm (38 of 114)7/3/2004 4:20:34 PM . } /* max_cohesion for terms in p has been computed */ /* adding parent-child links and marking parents as propagated */ q = levelptrs[i-1].. } /* end while(p) */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.0. q-mark = 1. . */ /* q set to the previous level's first /* as long as there are terms in this previous coh = cohesion(p->term->term.

levelptrs[i] = r. exit(2).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION /* checking all terms in level[i-1] to make sure they have propagated */ q = levelptrs[i-1]. r child */ mark = 0. /* inserting r at beginning of level i chain */ /*********************************************************************** cohesion(term1.. if (!r) { (void) fprintf(output. term has no child in next if (q mark == 0) { level */ q mark = 1. /* making a copy of term as its dummy r->nextterm = levelptrs[i].htm (39 of 114)7/3/2004 4:20:34 PM . term2) Returns: Purpose: void Compute the cohesion between two terms file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap09.. } q = q } } } /* for */ nextterm.e."\nout of memory\n"). while (q) { /* i. } r->term = q->term. r = (struct termlist *)malloc(sizeof(termlistfile)).

term2) char term1[MAXWORD]. common).htm (40 of 114)7/3/2004 4:20:34 PM . l2. &common).. term2. /* in: compared to determine cohesion /* holds # of documents associated with term 1 */ /* holds # of documents associated with term 2 */ /* holds # of documents in common */ get_term_data(term1.. Returns: Purpose: void Given two terms. l1. common) char term1[MAXWORD].Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION **/ static float cohesion(term1. common. it determines the number of documents in each and in common between them. term2. l2. l1. /* out: # of documents associated with term 1 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. float */ *11.2. term2. 11. } /*********************************************************************** get_term_data(term1.ooks_Algorithms_Collection2ed/books/book5/chap09. { float 12. return(common/(sqrt(/11 * 12))). &/11. /* in: term 1 /* in: term 2 */ */ term2[MAXWORD]. **/ static void get_term_data(term1. /* in: the two terms which are being */ */ term2[MAXWORD].

Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION *12. count2 = 0. */ /* for each doc. com = 0. { struct */ invert *p. # */ com = com +1. */ int count2. break.. of term2 while if (doc_ad2) { doc) { (doc_ad1 doc = doc_ad2 /* if they are the same doc. of term1 loop through all docs. */ doc_ad1 = p-doc. *q. file struct doclist *doc_ad1. *common. count1. /*tracks addresses of doclists records /* # of documents associated with term 1 */ /* # of documents associated with term 2 */ /* # of documents in common q = find_term(term2). doc_ad2 = q doc.. */ /* find addresses of terms /* start with doclist record for /* initialize */ /* first get length for document 1 and number of common terms */ while (doc_ad1) { count1 = count1 +1.ooks_Algorithms_Collection2ed/books/book5/chap09. com. *doc_ad2.htm (41 of 114)7/3/2004 4:20:34 PM . term1 */ count1 = 0. /* out: # of documents associated with term 2 */ /* out: # of documents associated with both */ /* holds addresses of both terms in inv. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. p = find_term(term1).

ooks_Algorithms_Collection2ed/books/book5/chap09. return address of the corresponding inverted file record. **/ static struct invert *find_term(term) char term[MAXWORD]. { while (doc_ad2) count2 = count2 + 1. } /* now get length of document 2 */ doc_ad2 = q doc.. *common = com. } doc_ad1 = doc_ad1-nextdoc. nextdoc. } /*********************************************************************** *find_term(term) Returns: Purpose: address of a struct invert record Search for a specified term in the inverted file and *l2 = count2..htm (42 of 114)7/3/2004 4:20:34 PM /* in: term to be located in inverted file */ . doc_ad2 = doc_ad2 } *l1 = count1. { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } doc_ad2 = doc_ad2-nextdoc.

. file (!strcmp(term.htm (43 of 114)7/3/2004 4:20:34 PM . inv_addr = inv_addr-nextterm."Term %s not found\n". return (NULL). } /*********************************************************************** *get_mem_invert( ) Returns: Purpose: **/ static struct invert *get_mem_invert( ) { struct_invert *record. if (!record) { address of a struct invert record dynamically obtain enough memory to store 1 invert record.. inv_addr-term)) return(inv_addr).term). file */ /* begin at top on inv. return (NULL). } (void) fprintf(output. record = (struct invert *)malloc(sizeof(invfile)). inv_addr = startinv. */ while(inv_addr) { if /* tracks addr."\nout of memory\n"). (void) fprintf(output.ooks_Algorithms_Collection2ed/books/book5/chap09. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. of current rec. in inv.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION struct invert *inv_addr.

return (NULL). } /*********************************************************************** *get_mem_parentlist() Returns: Purpose: address of a struct parentlist record dynamically obtain enough memory to store i parentlist recorded file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.."\nout of memory\n"). (void) fprintf(output.htm (44 of 114)7/3/2004 4:20:35 PM ..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } return(record). } /*********************************************************************** *get_mem_doclist( ) Returns: Purpose: **/ static struct doclist *get_mem_doclist( ) { struct doclist *record. if (!record) { address of a struct doclist record dynamically obtain enough memory to store 1 doclist record. record = (struct doclist *)malloc(sizeof(doclistfile)).ooks_Algorithms_Collection2ed/books/book5/chap09. } return(record).

"\nout of memory\n"). } /*********************************************************************** *get_mem_childlist() Returns: Purpose: **/ static struct childlist *get_mem_childlist( ) { struct childlist *record.."\nout of memory\n").ooks_Algorithms_Collection2ed/books/book5/chap09. record = (struct childlist *)malloc(sizeof (childfile)). record = (struct parentlist *)malloc(sizeof (parentfile)). return (NULL).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION **/ static struct parentlist *get_mem_parentlist ( ) { struct parentlist *record. } return(record).. if (!record) { (void) fprintf (output.htm (45 of 114)7/3/2004 4:20:35 PM . if (!record) { address of a struct childlist record dynamically obtain enough memory to store 1 childlist record (void) fprintf(output. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

. } /* PROGRAM NAME: PURPOSE: 2) 3) 4) 5) 1) select. PARAMETERS TO BE SET BY USER: 1) MAXWORD . Compute Dice's coefficient of similarity between two terms.htm (46 of 114)7/3/2004 4:20:35 PM . Compute cohesion between pairs of terms. sequences of: term weight document# (multiple entries for any document should be grouped together ) 2) term an inverted file. INPUT FILES REQUIRED: 1) a direct file. sequences of document# weight (multiple entries for any term should be grouped together) NOTES: Filters such as stop lists and stemmers should be used before before running this program.ooks_Algorithms_Collection2ed/books/book5/chap09. Compute Poisson Distributions for terms.maximum size of a term file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. c Compute Discrimination Value of terms. Parition terms by their total within document frequencies. } return(record)..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION return(NULL).

/* sequences of document # and weight pairs */ /* document number /* term weight in document */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.0 #define HIGH_THRESHOLD 4.0 struct termlist { /* sequences of term and weight pairs /* term /* term weight in document */ */ */ */ /* maximum size of a term */ /* maximum WDF for a word in a database */ char term[MAXWORD]. { struct doclist int doc.h> #define MAXWORD 20 #define MAXWDF 30 #define LOW_THRESHOLD 2.maximum value expected for the within document frequency for a term in the collecgtion.h> #include <string.threshold for MID and HIGH frequency ranges COMMAND LINE: select direct_file inverted_file output_file /*********************************************************************** #include <stdio. to next termlist record } termlistfile.htm (47 of 114)7/3/2004 4:20:35 PM . 3) 4) LOW_THRESHOLD .h> #include <math. /* ptr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION 2) MAXWDF . float weight... struct termlist *nextterm.ooks_Algorithms_Collection2ed/books/book5/chap09. float weight.threshold for LOW and MID frequency ranges HIGH_THRESHOLD .

static struct direct *startdir. /* ptr. struct direct int docnum. } invfile. struct invert { { /* ptr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION struct doclist *nextdoc. } directfile. to last record in direct file */ /* ptr. to last record in inverted file /* ptr. */ static struct invert *lastinv. static FILE *input1. static struct termlist *lastterm. struct invert *next. } doclistfile. */ static struct doclist *lastdoc. /* sequences of document # and weight pairs */ /* ptr. to next doclist record */ /* direct file: document to list of terms*/ /* document # /* sequences of term and weight pairs /* ptr. struct termlist *terms. static struct direct *lastdir.ooks_Algorithms_Collection2ed/books/book5/chap09..htm (48 of 114)7/3/2004 4:20:35 PM . /* direct file /* inverted file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. to centroid record static FILE *input.. struct doclist *doc. struct direct *next. /* ptr. to next invert record */ static struct invert *startinv. to first record in direct file */ /* ptr. to last document in doclist */ /* ptr. to next direct record */ */ */ /* inverted file: term to list of documnts */ /* term */ char term[MAXWORD]. to last term in termlist */ */ */ */ static struct direct *start_centroid. to first record in inverted file /* ptr.

*/ static float av_doc_similarity( ). add_invert( ). file */ static char currenterm[MAXWORD]. */ static int Number_of_docs. */ /* compute cohesion between two terms */ /* initialize files and global variables */ /* read in the direct file */ read_directfile( ). */ factorial( ). cohesion( ).. cosine( ). add_direct( ). dice( ).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static FILE *output.. pr_direct( ). */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap09. static int currentdoc. /* called within read_directfile( ) */ /* print the direct file /* read in the inverted file /* called within read_invfile( ) /* print the inverted file */ */ */ */ /* compute the document centroid for dbse. total_wdf( ). similarity in dbse. pr_invert( ). static void initialize( ). /* compute factorial of a number */ /* compute cosine between two terms*/ /* compute dice beteen two terms */ /* compute total frequency of term in dbse. /* file to hold all output */ /* tracks current document in direct /* tracks current term in inverted file /* total # of document which is computed /* compute average doc.htm (49 of 114)7/3/2004 4:20:35 PM . centroid( ). read_invfile( ).

dv_all( ).. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. (void) printf("Correct usage is\n").ooks_Algorithms_Collection2ed/books/book5/chap09.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION pr_centroid( ). if (argc!=4) { (void) printf ("There is an error in the command line\n"). get_Poisson_dist( ). get_term_data( ). /* print the document centroid */ /* compute Poisson distributions for terms */ /* partition terms by frequency */ /* compute discrimination value of terms */ /* get basic info. word2[MAXWORD].htm (50 of 114)7/3/2004 4:20:35 PM /* searches for term in inverted file */ */ . about documents */ /* get basic info. { char ch. char *argv[ ].. word1[MAXWORD]. are */ static struct invert *get_mem_invert( ). /* and returns its address int main(argc.argv) int argc. get_doc_data( ). */ /* these 4 get_mem functions /* used to obtain memory for a /* record. */ static struct doclist *get_mem_doclist( ). about terms */ static struct direct *get_mem_direct( ). /* obvious from the name */ struct invert *find_term( ). Partition_terms( ). The record type is static struct termlist *get_mem termlist( ).

(void) printf ("TO compute Poisson distributions enter 2\n"). (void) fprintf(output."\nPRINTING INVERTED FILE\n")."\nREADING IN DIRECT FILE\n"). } initialize(argv). exit(1). (void) printf("\nPlease make a selection\n\n")..Number_of_docs).htm (51 of 114)7/3/2004 4:20:35 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. read_invfile( ). (void) printf("Enter your choice: "). (void) printf ("To compute DV for all terms enter 1\n"). pr_invert( ). pr_direct( )."\nPRINTING DIRECT FILE\n\n"). (void) printf("To partition terms by frequency enter 3\n"). (void) fprintf(output."\nNUMBER OF DOCUMENTS IS: %d\n\n". (void) printf ("To compute cohesion between two terms (for phrase construction) enter 4\n"). (void) printf("To compute Dice's coefficient between two terms enter 5\n"). (void) fprintf(output.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) printf ("select direct_file inverted_file output_file\n"). (void) printf ("To quit enter 6\n\n"). (void) fprintf(output. read_directfile( ). (void) fprintf(output.ooks_Algorithms_Collection2ed/books/book5/chap09.."\nREADING IN INVERTED FILE\n").

"WDF = Within Document Frequency & #docs = Number of documents\n\n"). switch(ch) { case '1': centroid( ).. "\nCENTROID\n\n").ooks_Algorithms_Collection2ed/books/book5/chap09. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. (void) fprintf(output. MEDIUM. (void) fprintf(output. pr_centroid( )."\nACTUAL AND POISSON DISTRIBUTIONS OF WITHIN DOCUMENT FREQUENCIES FOR ALL TERMS\n\n"). break. (void) fprintf(output. dv_all( ). get_Poisson_dist( ). break.htm (52 of 114)7/3/2004 4:20:35 PM ."\nPARTITIONING THE TERMS INTO LOW."\nDISCRIMINATION VALUES FOR ALL TERMS\n\n"). case '2': (void) fprintf(output.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION ch = getchar( ). HIGH FREQUENCY CLASSES\n\n"). case '3': (void) printf("Make sure that the threshold parameters are set correctly in the programm\n").. (void) fprintf(output.

%s is not in the inverted file\n". break. word2). %s is not in the inverted file\n". break. word1). word2)). case '4': (void) printf("enter first word: ").. word1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap09. "Cohesion between %s and %s is %f\n". break. if (find_term(word2) == NULL) { "). if (find_term(word1) == NULL) { printf("sorry.. word1). } (void) fprintf(output. printf("sorry. word2). } (void) printf("enter second word: (void) scanf("%s". if (find_term(word1) == NULL) { "). cohesion(word1. case '5': (void) printf("enter first word: (void) scanf ("%s". (void) scanf("%s".Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Partition_terms( ). word1). break.htm (53 of 114)7/3/2004 4:20:35 PM . word2.

word2). word1.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION printf("sorry. %s is not in the inverted file\n".word2)).ooks_Algorithms_Collection2ed/books/book5/chap09. break. word2). (void) fclose(input). } (void) fprintf(output.htm (54 of 114)7/3/2004 4:20:35 PM . break. (void) scanf("%s". break. word2."Dice's coefficient between %s and %s is %f\n". } (void) fclose(input1). (void) fclose(output).word1)... } (void) printf ("enter second word: "). if (find_term(word2) == NULL) { printf("sorry. %s is not in the inverted file\n". return(0). default: (void) printf("no selection made\n"). } /*********************************************************************** initialize(argv) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.dice(word1. case '6': exit(0).

lastdir = NULL. startdir = NULL. /* input inverted file */} if (( output = fopen(argv[3]. start_centroid = NULL. Number_of_docs = 0."w")) == NULL) { (void) printf("couldn't open file %s for output\n". } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD."r")) == NULL ) { (void) printf("couldn't open file %s\n". line */ { if (( input = fopen(argv[1]. exit(1).argv[1]). lastterm = NULL.ooks_Algorithms_Collection2ed/books/book5/chap09.. . /* output file */ } /* set initial values of global variables */ startinv = NULL. exit(1).htm (55 of 114)7/3/2004 4:20:35 PM currentterm[0] = '\0'. currentdoc = 0.argv[3]).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Returns: Purpose: **/ void Open all required files and initialize global variables static void initialize(argv) char *argv[ ].. /* input direct fil */} if (( input1 = fopen(argv[2]. exit(1). lastinv = NULL. lastdoc = NULL.argv[2])."r")) = NULL ) { /* in: holds the three parameters input at the command (void) printf("couldn't open file %s\n".

. while (docid > 0) /* while its found a legitimate document number { if (docid == currentdoc) { /* read the next line */ */ /* if this document number has previously been entered in direct file */ /* then only need to attach a termlist record to the same entry */ p = get_mem_termlist( ).temp. /* get memory for a termlist record */ /* copy the new word over */ (void) strncpy(p-term.MAXWORD).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION /*********************************************************************** /* read_directfile( ) Returns: Purpose: **/ static void read_directfile( ) { int docid..htm (56 of 114)7/3/2004 4:20:35 PM ."%d%s%f". /* holds the current document number /* holds the current term /* holds the current term weight /* structure to store the term-weight pair */ */ */ */ void Read in the direct file entries from the 1st input file (void) fscanf(input. p->weight = weight.&weight).temp. struct termlist *p.ooks_Algorithms_Collection2ed/books/book5/chap09. char temp[MAXWORD]. /* assign the new weight over */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.&docid. float weight.

**/ static void add_direct(docid.temp. if (lastterm) lastterm->nextterm = p.ooks_Algorithms_Collection2ed/books/book5/chap09.temp. /* the direct file } docid = 0.temp..weight).&weight)..weight) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION p->nextterm = NULL. add_direct(docid.htm (57 of 114)7/3/2004 4:20:35 PM .&docid.temp. (void) fscanf(input. /* set this global variable */ } else { /* else docid represents a new document */ Number_of_docs = Number_of_docs +1."%d%s%f". } } /*********************************************************************** add_direct(docid.weight) Returns: Purpose: void Start a new entry in the direct file. chain for this document */ lastterm = p. It is called in /* increment global variable */ /* connect p to the termlist /* starts a brand new entry in */ */ the read_directfile function when a new document number is read from the input file.

p->next = NULL..htm (58 of 114)7/3/2004 4:20:35 PM . /* get memory for termlist structure */ /* assign index term to it */ */ */ */ (void) strncpy(p->terms->term. { struct direct *p..temp. char temp[MAXWORD].MAXWORD). lastterm = p->terms. /* document number just added } /*********************************************************************** file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. float weight. */ lastdir = p. p = get_mem_direct( ). /* in: new document number /* in: index term /* in: index term weight */ */ */ /* structure p will be attached to direct file */ /* get memory for p */ */ /* assign the document number p->terms = get_mem_termlist( ). p->docnum = docid.ooks_Algorithms_Collection2ed/books/book5/chap09. /* update pointer to last term */ /* update the global variable currentdoc to the */ */ /* update pointer to last direct file rec. p->terms->nextterm = NULL. if (startdir == NULL) startdir = p /* assign term weight to it /* current end of termlist /* current end of direct file /* if this is the very first document then global variable pointing to start of direct file should be updated */ if (lastdir) lastdir->next = p. currentdoc = docid. p->terms->weight = weight.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION int docid.

*/ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.dir_addr-docnum). */ */ term_addr = dir_addr-terms." TERM TERM WEIGHT\n"). It prints sequences of document# term /* tracks address of current termlist record */ */ */ /* start with beginning of direct file /* check for legitimate direct file record (void) fprintf(output. of first term while (term_addr) { /* loop through all the terms (void) fprintf(output. (void) fprintf(output. record */ struct termlist *term_addr." %-30s ". (void) fprintf(output. /* go to next direct file record } */ /* go to next term for the doc. **/ static void pr_direct ( ) { struct direct *dir_addr.ooks_Algorithms_Collection2ed/books/book5/chap09. dir_addr = dir_addr-next. term_addr = term_addr-nextterm.term_addr-term).term_addr-weight). dir_addr = startdir. } (void) fprintf(output.3f\n". while (dir_addr { /* tracks address of current direct file void Print the direct file.htm (59 of 114)7/3/2004 4:20:35 PM . /* get addr.."%-10."\n").Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION pr_direct() Returns: Purpose: weight."DOCUMENT NUMBER: %d \n"..

&weight)."%s%d%f". p->weight = weight. char temp[MAXWORD].. p->doc = docid..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } /*********************************************************************** read_invfile( ) Returns: Purpose: **/ static void read_invfile( ) { int docid.&docid. /* get memory for doclist record /* assign document number */ /* assign weight */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.temp.MAXWORD)) { */ */ */ */ /* if this term has previously been entered in inverted file /* then only need to attach a doclist record to same term entry p = get_mem_doclist( ).ooks_Algorithms_Collection2ed/books/book5/chap09. struct doclist *p.temp. float weight. /* holds currenct document number /* holds current term /* holds current term weight /* structure to store doc#-weight pair */ */ */ */ void Read in the inverted file entries from 2nd input file (void) fscanf(input1.htm (60 of 114)7/3/2004 4:20:35 PM . /* read next line */ while (strlen(temp) > 0) /* while its found a legitimate term { if (!strncmp(currentterm.

ooks_Algorithms_Collection2ed/books/book5/chap09.. It is called in the read_invfile function when a new term is read from the input file **/ static void add_invert(docid.&weight).temp. (void) fscanf(input1."%s%d%f". /* else term is a brand new term & need to make a new inverted file entry */ temp[0] = '\0'. chain for this ter */ */ /* connect p to the doclist lastdoc = p.temp.htm (61 of 114)7/3/2004 4:20:35 PM .Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION p->nextdoc = NULL.temp. Returns: Purpose: void Start a new entry in the inverted file.weight). if (lastdoc) lastdoc->nextdoc = p.weight). /* set this global variable } else add_invert(docid.&docid. /* read next line */ } } /*********************************************************************** add_invert(docid.weight) int docid. /* in: document number /* in: new index term /* in: index term weight */ */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. char temp[MAXWORD]. float weight.temp.

/* if this is the first entry in inverted file. to last document */ (void) strncpy(currentterm.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION { struct invert *p. currentterm to the */ /* new term just entered */ } /*********************************************************************** pr_invert( ) Returns: Purpose: void Print the inverted file. p->doc = get_mem_doclist( ). p = get_mem_invert( ). /* update global var. if (startinv == NULL) startinv = p. /* structure p will be attached to inverted file */ /* get memory for p */ /* copy over the term */ (void) strncpy(p->term. p->next = NULL.ooks_Algorithms_Collection2ed/books/book5/chap09.htm (62 of 114)7/3/2004 4:20:35 PM . to last inverted file record */ lastinv = p. p->doc->nextdoc = NULL.temp. p->doc->weight = weight. It prints sequences of /* assign document number */ /* assign term weight */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. lastdoc = p->doc.temp. /* update ptr.MAXWORD).MAXWORD). /* update ptr. /* start a doclist structure */ p->doc->doc = docid. */ if (lastinv) lastinv->next = p.. then update global var..

inv_addr = inv_addr->next. (void) fprintf(output. doc_addr = doc_addr->nextdoc. (void) fprintf(output. of first document */ while (doc_addr) { /*loop through all the associated doc. of next document */ } (void) fprintf(output."%-10. while (inv_addr) { /* start with beginning of inverted file */ /* check for legitimate inverted file record */ (void) fprintf(output..." doc_addr = inv_addr-doc. DOCUMENT NUMBER TERM WEIGHT\n"). **/ static void pr_invert( ) { struct invert *inv_addr.htm (63 of 114)7/3/2004 4:20:35 PM ."TERM: %s\n". */ /* tracks address of current inverted file record struct doclist *doc_addr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION term document# weight. /* go to next inverted file record */ } } /*********************************************************************** centroid( ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* tracks address of current doclist record */ inv_addr = startinv. /* get addr.inv_addr-term).5f\n".doc_addr-doc). /* get addr."\n").#s and weights*/ (void) fprintf(output.doc_addr-weight).ooks_Algorithms_Collection2ed/books/book5/chap09." %-30d ".

/* centroid stored as direct file record */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. **/ static void centroid( ) { struct invert *inv_addr. but could be used in computing DV for terms). record */ struct doclist *doc_addr. /* tracks address of current inverted file /* tracks address of current doclist record*/ /* tracks total weight for each term */ av_term weight. its average weight in the database.. (Note that these average weights are not used anywhere.ooks_Algorithms_Collection2ed/books/book5/chap09. The centroid is computed by determining for each index term in the inverted file..htm (64 of 114)7/3/2004 4:20:35 PM . Document number given to it is Number_of_docs +1. These average weights are then stored in the direct file entry for the centroid. the direct file *lastterm. /* holds average term weight for each term*/ /* structure used to create centroid entry in */ /* tracks the last term in the centroid */ start_centroid = get_mem_direct( ). Notes: Centroid is stored as a direct file record. float total_weight.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Returns: Purpose: void Compute and return the centroid for the documents of the database. struct termlist *q.

/* update total weight for term */ doc_addr = doc_addr->nextdoc.# */ start_centroid->next = NULL. lastterm = NULL. while (doc_addr) { /* start with a 0 total weight for this term */ /* if this is a legitimate doc.0. inv_addr = startinv. q->nextterm = NULL. */ total_weight = total_weight + doc_addr->weight. /* else connect this term to the centroid's termlist chain */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. total_weight = 0.inv-addr->term.*/ /* get address of first document */ doc_addr = inv_addr->doc.htm (65 of 114)7/3/2004 4:20:35 PM . /* loop through all docs. /* assign its pseudo doc. while (inv_addr) { /* begin at top of inverted file */ /* end of direct file chain */ /* while there is a legitimate inv. if (lastterm == NULL) start_centroid->terms = q. for the term */ } av_term_weight = total_weight/Number_of_docs. file record.. /* if this is the first term entry for the centroid */ else lastterm->extterm = q. (void) strncpy(q->term. q->weight = av_term_weight. */ q = get_mem_termlist( )..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION start_centroid->docnum = Number_of_docs +1..MAXWORD). /* calculating average term wt. addr..ooks_Algorithms_Collection2ed/books/book5/chap09.

inv_addr = inv_addr->next.htm (66 of 114)7/3/2004 4:20:35 PM ."-----------------------------------------\n").."%-30s "."----------------------------------------\n").. /* go on to the next inverted file entry */ } } /*********************************************************************** pr_centroid( ) Returns: Purpose: **/ static void pr_centroid( ) { struct termlist *term_addr."TERM WEIGHT \n"). */ (void) fprintf(output. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* get first term address */ while (term_addr) { /* printing out term and weight pairs */ (void) fprintf(output. term_addr = start_centroid->terms. (void) fprintf(output.term_addr->term).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION lastterm = q.ooks_Algorithms_Collection2ed/books/book5/chap09. /* tracks address of current termlist record */ void Print the centroid from the direct file /* note the centroid is given a document number = Number_of_docs + 1 */ /* therefore it may be treated as a special kind of document vector if (start_centroid) { /* if there is a centroid */ (void) fprintf (output.

It is assumed that the within document frequency of a term is stored as the term weight in the inverted file. term_addr = term_addr->nextterm.term_addr->weight)."%-10.htm (67 of 114)7/3/2004 4:20:35 PM . /* tracks address of current inverted file record /* tracks address of current doclist record */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) fprintf(output.ooks_Algorithms_Collection2ed/books/book5/chap09. */ struct doclist *doc_ad.."\n").. **/ static void get_Poisson_dist( ) { struct invert *inv_addr. distribution PART II: Determine the distribution anticipated under the single Poisson model. } } /*********************************************************************** get_Poisson_dist( ) Returns: Purpose: Notes: PART I: void Get the Poisson distribution data for any term This function has two parts: Determine the actual within doc. /* loop through all terms */ } (void) fprintf(output. freq.5f\n".

htm (68 of 114)7/3/2004 4:20:35 PM . /* PART I: /* single Poisson parameter */ */ /* counter to add information to the dist array */ /* counter to loop through dist array /* flag used to match wdf in dist array */ */ /* counter to track # of docs. lambda.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION float dist[MAXWDF][2]. result. exponent. file record */ docs_with_term = 0. doc_ad = inv_addr->doc. inv_addr = startinv. with the same wdf */ /* tracks the number of documents having the term */ /* these five local variables are /* used to determine expected distribution */ */ /* start at the beginning of the inverted file */ For each term determine the number of documents in the collection that have a particular wdf */ while (inv_addr) { /* check for legitimate inv.. /* store for each term */ /* column 1 = within document frequency (wdf) */ /* column 2 = document frequency int i. docs_with_term. i = 0. address */ */ /* used to check if this is the very first entry in dist while (doc_ad) { if (i == 0) { /* if first entry in dist */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* get the first doc. second. j. found.ooks_Algorithms_Collection2ed/books/book5/chap09. float first. numdocs..

docs_with_term++.ooks_Algorithms_Collection2ed/books/book5/chap09. frequency = 1 to first row in dist */ i++.j++) { /* loop through dist */ if (dist[j][0] == doc_ad->weight) { /* if found the same wdf */ dist[j][1] = dist[j][1] + 1. i++. hence look for any previous entries for the same wdf value */ found = 0. dist[i][1] = 1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. docs_with_term++. /* assign wdf and doc..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION dist[i][0] = doc_ad->weight. } } /* ending else */ dist[i][1] = 1. } } /* ending for */ { /* if not found the same wdf in dist */ /* add 1 to the doc. frequency */ if (found == 0) /* start new row in dist */ dist[i][0] = doc_ad->weight. } else { /* dist already has other entries..j < i. docs_with_term++. found = 1. for (j=0.htm (69 of 114)7/3/2004 4:20:35 PM .

"\nTerm = %s\n". /* PART II: */ /* computing lambda . (void) fprintf(output.j < i.inv_addr->term).Number_ (void) fprintf(output.the only single Poisson parameter (void) fprintf(output.htm (70 of 114)7/3/2004 4:20:35 PM . for (j=0.dist[j][0] WDF ").Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION doc_ad = doc_ad->nextdoc. first = exp(-lambda).dist[j][1]).ooks_Algorithms_Collection2ed/books/book5/chap09. /* computing document frequency for each within document frequency value */ while (numdocs != 0) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 0 %d\n". */ #docs\n"). 0f\n".. 0f %-6." (void) fprintf(output." ."\nActual Distribution: (void) fprintf(output. numdocs = -1. */ #docs\n")." WDF ")." of_docs->docs_with_term)./* loop through other documents for same term */ } /* ending while */ /* ending if */ /* now print out actual distribution information for this term (void) fprintf(output.. j = -1.j++) %-16. /* call the function total_wdf to compute the total frequency of the term */ lambda = (total_wdf(inv_addr->term))/Number_of_docs."\nExpected Distribution: (void) fprintf(output.

/* continue with the next inverted file term */ } /* end while */ } /*********************************************************************** factorial(n) Returns: Purpose: **/ static float factorial(n) int n. { /* in: compute factorial for this parameter */ float Return the factorial of a number.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION j = j + 1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap09. exponent = j.numdocs). { second = pow(lambda.exponent).j.5) numdocs = floor(result)..floor(result) ) else numdocs = ceil(result). (void) fprintf (output.. if else /* type conversion necessary for pow function */ (j == 0) result = first * Number_of_docs. } if ( (result . Used in get_poisson_dist %-16d %-6d\n". result = ( ( (first * second)/factorial(j)) * Number_of_docs). 0.htm (71 of 114)7/3/2004 4:20:35 PM ." } inv_addr = inv_addr->next.

*/ { struct invert *inv_addr. Notes: It is assumed that the appropriate term frequency is stored as the term weight in the inverted file. /* tracks current inverted file record */ /* tracks current doclist record /* tracks the total frequency */ */ /* in: term for which total frequency is to be found file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.0). This routine can also be used to filter out the low and high frequency terms. return(answer)... /* holds the result */ if (n==1) return(1. float total.htm (72 of 114)7/3/2004 4:20:35 PM . The resulting mid frequency terms can be used as input to the program which generates hierarchies.ooks_Algorithms_Collection2ed/books/book5/chap09. } /*********************************************************************** total_wdf(term) Returns: Purpose: float Compute total frequency in the database for a specified term using the inverted file. **/ static float total_wdf(term) char term[MAXWORD]. struct doclist *doc_addr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION float answer. answer = factorial(n-1)*n.

/* get first associated document address */ while (doc_addr) { /* update the total frequency */ total = total + doc_addr-weight.. /* initial value */ /* function find_term will find out where the term is in the inverted file */ inv_addr = find_term(term). if (inv_addr) { /* if this term was found in the inverted file */ doc_addr = inv_addr->doc.ooks_Algorithms_Collection2ed/books/book5/chap09. doc_addr = doc_addr->nextdoc. This function utilizes two parameters defined at the top of the program: LOW_THRESHOLD and HIGH_THRESHOLD. */ } } return(total).htm (73 of 114)7/3/2004 4:20:35 PM .0. /* loop through other associated docs. MEDIUM or LOW frequency depending upon its total frequency in the collection.. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION total = 0. which should be set by the user. } /*********************************************************************** Partition_terms( ) Returns: Purpose: void Assign each term in the inverted file to one class: HIGH.

else (void) fprintf(output. */ float total." LOW\n").Total Frequency .. /* tracks address of current inverted file record /* holds total frequency of each term */ inv_addr = startinv. (void) fprintf(output. inv_addr = inv_addr->next. if (total < LOW_THRESHOLD) (void) fprintf(output.htm (74 of 114)7/3/2004 4:20:35 PM float Compute the cohesion between two terms .Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION **/ static void Partition_terms( ) { struct invert *inv_addr. /* continue with next inverted file entry */ } } /*********************************************************************** cohesion(term1. total)."\n%s ."\nTerm . while (inv_addr) { /* if a legitimate address */ /* compute total frequency for term in collection */ total = total_wdf(inv_addr->term)." MEDIUM\n"). else if (total > HIGH_THRESHOLD) (void) fprintf(output." HIGH\n")..Frequency Class\n\n"). term2) Returns: Purpose: **/ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.%f -". /* start at the beginning of the inverted file */ (void) fprintf(output.ooks_Algorithms_Collection2ed/books/book5/chap09.inv_addr->term.

Term weights are not involved **/ static void dv_all( ) { struct invert *inv_addr.ooks_Algorithms_Collection2ed/books/book5/chap09.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static float cohesion(term1. l1.. /* holds # of documents associated with term1 */ /* holds # of documents associated with term2 */ /* holds # of documents in common */ /* in: the two terms which are being */ /* in: compared to determine cohesion */ get_term_data(term1.. &l1. &l2. &common). term2) char term1 [MAXWORD]. { float l2.htm (75 of 114)7/3/2004 4:20:35 PM . return(common/(sqrt(l1 * l2) ) ). term2 [MAXWORD]. file record */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } /*********************************************************************** dv_all( ) Returns: Purpose: database Notes: Similarity between two documents as calculated here is a void Compute Discrimination Value (DV) for all terms in the function of the number of terms in common and the number of terms in each. term2. /* tracks address of current inv. common.

ooks_Algorithms_Collection2ed/books/book5/chap09. "TERM DV\n"). **/ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. DV)." %-30s %-10. (void) fprintf(output..5f inv_addr = inv_addr->next. (void) fprintf(output. /* holds computed DV /* holds baseline similarity */ */ /* first compute baseline similarity */' baseline = av_doc_similarity("-"). } } /*********************************************************************** av_doc_similarity(term) Returns: Purpose: float Compute average similarity between each document \n".. while (inv_addr) { /* begin at top of inverted file */ /* if legitimate inverted file record */ DV = av_doc_similarity(inv_addr ->term . inv_addr = startinv.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION float DV. inv_addr->term. The word specified in term is ignored during computations.htm (76 of 114)7/3/2004 4:20:35 PM .baseline. "-------------------------------------------/n"). baseline. /* go to next inverted file record */ and the centroid of the database."--------------------------------------------/n"). */ /* the dummy term '-' is used for this (void) fprintf(output. (void) fprintf(output.

Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static float av_doc_similarity (term) char term[MAXWORD]. common. start_centroid. dir_addr = dir_addr->next. dir_addr = startdir.. /* go to next direct file record */ } return (total_sim/Number_of_docs). term).ooks_Algorithms_Collection2ed/books/book5/chap09. total_sim. /* begin with first direct file record */ while (dir_addr) { /* get_doc_data returns #of terms in each document and #of terms in common */ get_doc_data(dir_addr. total_sim = 0. total_sim = total_sim + cosine (dl1. term2) Returns: Purpose: float Returns Dice's coefficient of similarity between /* tracks current direct file record /* holds # of terms in document /* holds # of terms in centroid /* holds # of terms in common between them /* holds total similarity between them */ */ */ */ */ /* in: term is ignored during computations */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. float dl1.0. dl2. dl2. common). &dl1. &dl2.. } /*********************************************************************** dice (term1. { struct direct *dir_addr. &common.htm (77 of 114)7/3/2004 4:20:35 PM .

. if (l1 == 0 || l2 == 0) return(0. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. common) Returns: Purpose: **/ static float cosine (l1. common.0). term2) char term1[MAXWORD]. term2. { float l1. &l2. &l1. &common). return(common/(l1 + l2)). l2. l2.htm (78 of 114)7/3/2004 4:20:35 PM /* in: the two terms that are being compared */ /* in: */ float Returns cosine similarity between two documents /* in: # of terms associated with document /* in: # of terms associated with document 2 */ /* in: # of terms in commn between then */ 1*/ . l2. common) float l1. common.. l2. { float temp.ooks_Algorithms_Collection2ed/books/book5/chap09.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION any two documents **/ static float dice (term1. get_term data(term1. term2[MAXWORD]. } /*********************************************************************** cosine (l1.

. float *l1. l2. /* number of terms in second doc. termlist */ int count1. it determines the number of of index terms in each and in common. /* holds address of second docs.ooks_Algorithms_Collection2ed/books/book5/chap09. /* number of terms in first doc.. termlist*/ /* in: term to be excluded from computations /* in: addresses of two documents numbers /* out: number of terms in each document /* out: number of terms in common */ */ */ */ *term_addr2. *l2.q. float *common. index) char index[MAXWORD]. */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. q.l2. l1. struct direct *p. **/ static void get_doc_data (p.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION if (11 == 0 || 12 == 0) return (0. return(common/temp). temp = sqrt (11 * 12). *q. /* holds address of first docs. l1. common. { struct termlist *term_addr1. count2. common.0).htm (79 of 114)7/3/2004 4:20:35 PM . } /*********************************************************************** get_doc_data(p. It will exclude the index term (specified as the last parameter) from consideration. Used in av_doc_similarity for DV calculations. index) Returns: Purpose: void Given two document numbers.

} term_addr2 = term_addr2-nextterm.. index. count2 = 0. while (term_addr2) { if (!strncmp(term_addr1-term. /* number of terms in common */ count1 = 0. } /* now find out number of terms in document 2 */ term_addr2 = q ->terms. while (term_addr2) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION com.ooks_Algorithms_Collection2ed/books/book5/chap09. index. MAXWORD)) count1 = count1 + 1. term_addr1 = p->terms. MAXWORD)) com = com + 1. /* if its not the term to exclude */ term_addr2 = q->terms. MAXWORD)) { /* if the two terms are the same */ if (! strncmp(term_addr1-term.htm (80 of 114)7/3/2004 4:20:35 PM . } term_addr1 = term_addr1-nextterm. /* if they do not match the term to exclude */ break.. term_addr2-term. com = 0. /* first find out number of terms in document 1 & # of common terms */ while (term_addr1) { if (strncmp(term_addr1->term.

/* in: term 1 to be compared with */ */ term2[MAXWORD]. l2. l1. common) char term1[MAXWORD]. /* tracks doclist for first term *doc_ad2. *12 = count2. l2. **/ static void get_term_data(term1. *q. l1. term2.ooks_Algorithms_Collection2ed/books/book5/chap09. /* tracks doclist for second term * file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. *l2. index. } /*********************************************************************** get_term_data(term1. MAXWORD)) count2 = count2 + 1. term2. file */ */ struct doclist *doc_ad1.. *common = com. { struct invert *p. float *l1. /* in: term 2 /* out: # of documents associated with 1st term */ /* out: # of documents associated with 2nd term */ /* out: # of documents in common between them */ /* holds address of first term in inverted file */ /* holds address of second term in inv.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION if (strncmp(term_addr2->term. } *l1 = count1. term_addr2 = term_addr2-nextterm..htm (81 of 114)7/3/2004 4:20:35 PM . common) Returns: Purpose: void Get info regarding number of documents in common between any two terms. *common.

count2. /* holds # of documents associated with 1st term */ /* holds # of documents associated with 2nd term */ /* holds # of documents common between them */ /* find addresses of both terms in the inverted file */ p = find_term(terml).htm (82 of 114)7/3/2004 4:20:35 PM . /* obtain 1st terms doclist address */ count1 = 0. break. com. in common */ while (doc_ad1) { count1 = count1 +1.ooks_Algorithms_Collection2ed/books/book5/chap09. doc_ad2 = q->doc. while (doc_ad2) { if (doc_ad1->doc == doc_ad2->doc) { /* if the document numbers are the same */ com = com +1.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION int count1. /* first get # of documents indexed by term 1 & # of docs. } doc_ad1 = doc_ad1->nextdoc. count2 = 0. com = 0. q = find_term(term2).. } /* now get # of documents indexed by term 2 */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } doc_ad2 = doc_ad2->nextdoc. doc_ad1 = p-doc..

} /*********************************************************************** *find_term(term) Returns: Purpose: address of a struct invert record search for a specified term in the inverted file & return address of the record **/ struct invert *find_term(term) char term[MAXWORD]. while (doc_ad2) { count2 = count2 + 1. of current rec. *l2 .inv_addr->term)) {return(inv_addr).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION doc_ad2 = q->doc.htm (83 of 114)7/3/2004 4:20:35 PM /* in: term to be located in inverted file */ /* tracks addr.} inv_addr = inv_addr->next.. doc_ad2 = doc_ad2->nextdoc. inv_addr = startinv. } *l1 = count1. { struct invert *inv_addr.ooks_Algorithms_Collection2ed/books/book5/chap09. in inv. *common = com.. file */ . while(inv_addr) { if (!strcmp(term. file*/ /* begin at top of inv. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.count2.

.ooks_Algorithms_Collection2ed/books/book5/chap09."Findterm routine: Term %s not found\n"."\nout of memory\n"). } /*********************************************************************** *get_mem_termlist( ) Returns: Purpose: address of a struct termlist record dynamically obtain enough memory to store one termlist record file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. return(NULL).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) fprintf(output. } /*********************************************************************** *get_mem_direct( ) Returns: Purpose: **/ static struct direct *get_mem_direct( ) { struct direct *record. record = (struct direct *)malloc(sizeof(directfile))..term). if (!record) { address of a struct direct record dynamically obtain enough memory to store 1 direct record (void) fprintf(output. } return(record).htm (84 of 114)7/3/2004 4:20:35 PM . exit(0).

if (!record) { (void) fprintf(output. record = (struct invert *)malloc(sizeof(invfile)). exit(0). } return(record).htm (85 of 114)7/3/2004 4:20:35 PM ."\nout of memory\n"). if (!record) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION **/ static struct termlist *get_mem_termlist( ) { struct termlist *record. record = (struct termlist *)malloc(sizeof (termlistfile)).ooks_Algorithms_Collection2ed/books/book5/chap09. } /*********************************************************************** *get_mem_invert( ) Returns: Purpose: address of a struct invert record dynamically obtain enough memory to store one inverted file record **/ static struct invert *get_mem_invert( ) { struct invert *record...

."\nout of memory\n"). if (!record) { (void) (void) fprintf(output.. exit(0).ooks_Algorithms_Collection2ed/books/book5/chap09. } return(record). exit(0). /* PROGRAM NAME: merge.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) fprintf(output. } return(record). record = (struct doclist *)malloc(sizeof(doclistfile)). }* /*********************************************************************** *get_mem_doclist( ) Returns: Purpose: record **/ static struct doclist *get_mem_doclist( ) { struct doclist *record.c address of a struct doclist record dynamically obtain enough memory to store one doclist file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (86 of 114)7/3/2004 4:20:35 PM ."\nout of memory\n").

It can then perform two different types of mergers: 1) Simple merge in which a point of connection between the two thesauri is made wherever they have terms in common. 2) Complex merge in which any two terms are connected if they have sufficiently (above a specified threshold) similar sets of parent and child terms. INPUT FILES REQUIRED: 1) 2) 3) 4) inverted file for 1st hierarchy links file for 1st hierarchy inverted file for 2nd hierarchy links file for 2nd hierarchy An inverted file consists of sequences of term document number weight (multiple entries for any term should be grouped together) A link file consists of sequences of parent term child term PARAMETERS TO BE SET BY USER: 1) 2) MAXWORD: THRESHOLD: which specifies the maximum size of a term which specifies the minimum similarity level file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap09.htm (87 of 114)7/3/2004 4:20:35 PM .Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION PURPOSE: This program is used to merge two separate hierarchies... The program first reads each inverted file. It then reads in the corresponding link file which gives parent-child links to build the hierarchy.

COMMAND LINE: merge inverted_file_1 link_file_1 inverted_file_2 link_file_2 output_file /*********************************************************************** #include <stdio.6 struct doclist { int doc. to next doclist record */ */ */ /* ptr.htm (88 of 114)7/3/2004 4:20:35 PM . struct connections { /* holds information about connected terms */ struct invert *termadd..h> #define MAXWORD 20 #define THRESHOLD 0. /* sequences of parent terms /* parent term */ */ /* maximum size of a term */ /* similarity threshold for complex_merge */ /* sequences of document # and weight */ /* document number /* term weight in document /* ptr. */ /* address of connected term in inverted file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.h> #include <string. } doclistfile. struct doclist *nextdoc. to next parentlist record } parentfile. float weight.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION for use in complex merge. struct parentlist { char term[MAXWORD]. struct invert *parent. to parent term in inverted file */ */ struct parentlist *nextparent..ooks_Algorithms_Collection2ed/books/book5/chap09./* ptr.h> #include <math.

/* pointer to connection in other hierarchy */ struct invert *nextterm. /* inverted file /* term */ */ /* sequences of document # and weight */ /* pointer to list of parent terms /* pointer to list of children terms */ */ struct connections *connect. file */ static struct invert *start_inv2. to next childlist record } childfile. static struct invert *startinv. struct parentlist *parents.. /* ptr. struct invert *child. to child term in inverted file*/ */ struct childlist *nextchild.. to the start of 1st inverted /* ptr. } invfile. static struct invert *start_inv1. to last record in inverted file /* ptr. to next connections record */ /* sequences of child terms /* child term */ */ /* ptr. file */ /* ptr.ooks_Algorithms_Collection2ed/books/book5/chap09. to next invert record */ /* ptr. struct invert { char term[MAXWORD]. struct doclist *doc. to first record in inverted file /* ptr. to last document in doclist */ /* ptr. struct childlist { char term[MAXWORD]. */ static struct doclist *lastdoc.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION struct connections *next_connection } connectlist. */ static struct invert *lastinv. /* ptr. to the start of 2nd inverted file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (89 of 114)7/3/2004 4:20:35 PM . struct childlist *children.

/* storing different types of */ static struct childlist *get_mem_childlist( ). static void initialize( ).. read_links( ). and */ /* returns address of the term */ static int compare( ).. /* initialize global variables /* open files /* read in the inverted file /* called within read_invfile( ) /* read in the links information */ */ */ */ */ /* searches for term in inverted file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. */ /* first inverted file /* first link file /* second inverted file /* second link file /* holds any outputs */ */ */ */ */ /* tracks current term in inverted file static struct invert *get_mem_invert( ). open_files( ). static FILE *output.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static FILE *input1. dynamically for */ /* obtain memory static struct parentlist *get_mem_parentlist( ). */ static struct connections *get_mem_connections( ).htm (90 of 114)7/3/2004 4:20:35 PM . static struct invert *find_term( ). static FILE *input3. static FILE *input2. /* these four get_mem functions * / static struct doclist *get_mem_doclist( ). /* records.ooks_Algorithms_Collection2ed/books/book5/chap09. static char currentterm[MAXWORD]. add_invert( ). read_invfile( ). static FILE *input4.

. { char ch. (void) fprintf(output.. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. char *argv[ ]. open_files(argv). read_invfile(input1). "\nREADING FIRST INVERTED FILE\n"). (void) printf("merge inverted_file_1 link_file_1 inverted_file_2 link_file_2 output_file\n").argv) int argc. start_inv2 = NULL. (void) printf("Correct usage is:\n"). simple_merge( ). if (argc!=6) { /* called within read_links( ) /* print the inverted file */ */ /* simple merge between both hierarchies */ /* complex merge between hierarchies */ (void) printf( "There is an error in the command line\n"). int main(argc. exit(1).htm (91 of 114)7/3/2004 4:20:35 PM .ooks_Algorithms_Collection2ed/books/book5/chap09. /* initialize start of both inverted files */ initialize( ). complex_merge( ).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION add_link( ). } start_inv1 = NULL. pr_invert( ).

(void) fprintf(output. ch = getchar( ). (void) fprintf(output. start_inv2 = startinv. "\nPRINTING FIRST INVERTED FILE\n\n"). /* re-initialize */ initialize( ). (void) printf("To use the simple_merge algorithm. (void) printf("To use the complex_merge algorithm. enter 1\n").. "\nREADING SECOND INVERTED FILE\n").Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION start_inv1 = startinv. (void) printf("Make a selection\n"). (void) fprintf(output.htm (92 of 114)7/3/2004 4:20:35 PM .ooks_Algorithms_Collection2ed/books/book5/chap09. read_links(input4. start_inv2). read_links(input2. pr_invert(start_inv2). start_inv1). "\nREADING SECOND LINK FILE\n"). pr_invert(start_inv1). (void) fprintf(output. (void) fprintf(output. (void) printf("\nEnter selection: "). switch(ch) { case '1': file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. "\nREADING FIRST LINK FILE\n"). enter 2\n"). "\nPRINTING SECOND INVERTED FILE\n\n").. read_invfile(input3).

(void) fprintf(output. (void) fprintf(output. return(0). } (void) fclose(input1). case '2': (void) fprintf(output. pr_invert(start_inv1). (void) fprintf(output. (void) fprintf(output."\nPRINTING FIRST INVERTED FILE AFTER COMPLEX MERGE\n \n").ooks_Algorithms_Collection2ed/books/book5/chap09.."\nPRINTING SECOND INVERTED FILE AFTER SIMPLE MERGE\n \n"). pr_invert(start_inv2).(void) fclose(output).."\nPRINTING SECOND INVERTED FILE AFTER COMPLEX MERGE\n \n")."\nPRINTING FIRST INVERTED FILE AFTER SIMPLE MERGE\n \n").Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) fprintf(output.(void) fclose(input4). } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD."\nPERFORMING A COMPLEX MERGE OF THE TWO INVERTED FILES\n"). break. simple_merge(start_inv1. (void) fclose(input3).start_inv2).htm (93 of 114)7/3/2004 4:20:35 PM . pr_invert(start_inv2)."\nPERFORMING A SIMPLE MERGE OF THE TWO INVERTED FILES \n").(void) fclose(input2).start_inv2). pr_invert(start_inv1). complex_merge(start_inv1.

argv[2]). /* inverted file for second thesaurus hierarchy */ } if (( input4 = fopen(argv[4].ooks_Algorithms_Collection2ed/books/book5/chap09..argv[3]). /* inverted file for first thesaurus hierarchy */ } if (( input2 = fopen(argv[2]. exit(1). exit(1)."r")) == NULL ) { (void) printf("couldn't open file %s\n"."r")) == NULL ) { (void) printf("couldn't open file %s\n".htm (94 of 114)7/3/2004 4:20:35 PM . exit(1). { if (( input1 = fopen(argv[1]. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD."r")) == NULL ) (void) { printf("couldn't open file %s\n".Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION /*********************************************************************** open_files(argv) Returns: Purpose: **/ static void open_files(argv) char *argv[ ]."r")) == NULL ) { void Open all input & output files (void) printf("couldn't open file %s\n".argv[1]). /* link file for first thesaurus hierarchy */ } if (( input3 = fopen(argv[3]..argv[4]).

"w")) == NULL) { (void) printf("couldn't open file %s for output \n".Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION exit(1).ooks_Algorithms_Collection2ed/books/book5/chap09.. lastinv = NULL.. exit(1).htm (95 of 114)7/3/2004 4:20:35 PM . lastdoc = NULL. } /*********************************************************************** read_invfile(input) Returns: void /* start of inverted file /* end of inverted file /* last document considered */ */ */ void Initialize global variables /* output file */ /* current term being considered */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.argv[5]). } } /*********************************************************************** initialize( ) Returns: Purpose: **/ static void initialize( ) { startinv = NULL. currentterm[0] = '\0'. } /* link file for second thesaurus hierarchy */ if (( output = fopen(argv[5].

.MAXWORD)) { > 0) /* while a legitimate line /* if temp is the same as current term then simply add next document-weight info */ p = get_mem_doclist( ). &weight).ooks_Algorithms_Collection2ed/books/book5/chap09.temp.htm (96 of 114)7/3/2004 4:20:35 PM . "%s%d%f". char temp[MAXWORD]. { int docid. number /* assign doc. weight */ */ /* set this global variable */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Purpose: **/ Read the inverted file from a disk file static void read_invfile(input) FILE *input. /* holds current document number /* holds current term /* holds current term weight */ */ */ /* structure to hold document numner-weight pair */ /* read next line */ */ (void) fscanf(input. while (strlen(temp) { if (!strncmp(currentterm. p->weight = weight. } /* connect p to doclist chain for this /* assign doc. p->nextdoc = NULL.. p->doc = docid. term */ lastdoc = p. &docid. temp. float weight. if (lastdoc) lastdoc->nextdoc = p. struct doclist *p.

**/ static void add_invert(docid. { struct invert *p. /* read next input line */ } } /*********************************************************************** add_invert(docid. char temp[MAXWORD].. temp. /* start a new entry in the inverted (void) fscanf(input. file */ /* temp not the same as current term.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION else add_invert(docid. &weight). p->parents = NULL.MAXWORD). temp. /* structure p will be attached to inv. weight) int docid.ooks_Algorithms_Collection2ed/books/book5/chap09..htm (97 of 114)7/3/2004 4:20:35 PM .temp. &docid.temp.weight).temp. /* in: document number /* in: new index term /* in: index term weight */ */ */ /* get memory for p /* copy over the term */ */ */ (void) strncpy(p->term. hence */ temp[0] = '\0'. void Called in read_invfile when a new term is being read from the Starts a new entry in the inverted file. float weight.weight) Returns: Purpose: file. "%s%d%f". /* initially term has no parents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. file */ p = get_mem_invert( ).

Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION p->children = NULL.. /* if this is the 1st term in inverted /* always update lastinv pointer */ (void) strncpy(currentterm. /* also no children terms */ /* get memory for a doclist structure */ /* assign the document number /* assign term weight */ */ /* initially this term not connected to any */ if (startinv == NULL) startinv = p. p->doc->doc = docid.startinv) Returns: Purpose: **/ void Add the link information to the inverted file */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. p->doc->nextdoc = NULL. */ /* other in any other hierarchy p->nextterm = NULL. file */ if (lastinv) lastinv->nextterm = p. /* update the value of currentterm to the */ /* new term that has just been read } /*********************************************************************** read_links(input. lastinv = p. lastdoc = p-doc. p->connect = NULL.MAXWORD).ooks_Algorithms_Collection2ed/books/book5/chap09. p->doc->weight = weight..htm (98 of 114)7/3/2004 4:20:35 PM .temp. p->doc = get_mem_doclist( ).

.chile. child[0] = '\0'. /* holds parent term /* holds child term */ */ /* in: input file /* in: start of this inverted file */ */ (void) fscanf(input. } } /*********************************************************************** add_link(parent.child). { char parent[MAXWORD].child).ooks_Algorithms_Collection2ed/books/book5/chap09.startinv).child.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION static void read_links(input. (void) printf("Term %s or term %s is not in inverted file\n". exit(0).parent. "%s%s".startinv) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.startinv) | | !find_term(child.parent. /* this function makes links */ /* throw out old parent & child info */ /* read next line */ (void) fscanf(input.startinv)) { (void) printf("Please check your input files\n").htm (99 of 114)7/3/2004 4:20:35 PM .child). struct invert *startinv. "%s%s". parent. /* read first input line */ /* while non-trivial input while (strlen(parent) > 0 && strlen(child) > 0) */ { if (!find_term(parent. child[MAXWORD]. } add_link(parent. parent[0] = '\0'..startinv) FILE *input.

no parents listed for this child /* first parent link made */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. of both terms in inv. file */ /* holds adds. struct invert *startinv. */ /* structure to hold new child info.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Returns: Purpose: **/ void Function is used within read_links static void add_link(parent.. startinv) char parent[MAXWORD]. /* copy over parent term */ new_parent->parent = p. child. startinv). parent.. *q. startinv). /* find address of parent & child */ /* terms in the inverted file */ /* first add parent links for the given child */ new_parent = get_mem_parentlist( ).oks_Algorithms_Collection2ed/books/book5/chap09. */ struct parentlist *new_parent. specify child term /* in: specify start of this inv. e. new_parent->nextparent = NULL. /* get memory for parent record */ (void) strncpy(new_parent->term. file */ /* structure to hold new parent info. { struct invert *p. p = find_term(parent.(in inverted file) of parent term /* i. struct childlist *new_child. { /* store addr. MAXWORD). */ if (q->parents == (NULL) yet*/ q->parents = new_parent. /* in: /* in: specify parent term */ */ child[MAXWORD].htm (100 of 114)7/3/2004 4:20:35 PM . q = find_term(child.

/* get memory for child record */ (void) strncpy(new_child->term. */ new_child->child = q../* attach new child to front of list */ p->children = new_child.htm (101 of 114)7/3/2004 4:20:35 PM . no child terms listed for this parent /* first child link made */ /* at least 1 child already exists for given parent */ new_child->nextchild = p->children. new_child->nextchild = NULL. (in inverted file) of child term /* i. MAXWORD). } /* next add child links for given parent */ new_child = get_mem_childlist( ). */ if (p->children == NULL) { */ p->children = new_child. } } /*********************************************************************** pr_invert(startinv) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION } else { /* at least 1 parent already listed for given child */ new_parent->nextparent = q->parents.oks_Algorithms_Collection2ed/books/book5/chap09.. e. child. /* attach new parent in front of list */ q->parents = new_parent. } else { /* copy over child term /* store addr.

record */ /* tracks add. of current doclist record /* tracks add.*/ (void) fprintf(output. ". file /* in: specifies start of inverted file */ /* tracks add. Prints each term. while (inv_addr) { /* begin at top of inv. .. . of current childlist struct connections *connect_term_add."%s".parent_addr->term). **/ static void pr_invert(startinv) struct invert *startinv. while (parent_addr) { /* find addr. /* tracks connected terms inv_addr = startinv. inv_addr>term). its associated document numbers. { struct invert *inv_addr. .. of current inv. of first parent */ /* printing all parents */ (void) fprintf(output. */ struct parentlist *parent_addr. of current parentlist /* tracks add. parent_addr = inv_addr->parents. term-weights and parent and child terms. record */ struct doclist *doc_addr.oks_Algorithms_Collection2ed/books/book5/chap09.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Returns: Purpose: void Print either inverted file.htm (102 of 114)7/3/2004 4:20:35 PM . /* loop through remaining parents */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr."TERM: %s\nPARENT TERMS:. file */ */ /* while a legitimate term. record */ struct childlist *child_addr. parent_addr = parent_addr->nextparent.

connect_term_add = inv_addr->connect.".oks_Algorithms_Collection2ed/books/book5/chap09. "CONNECTIONS IN OTHER THESAURUS HIERARCHY:\n"). /* printing all documents (void) fprintf(output.doc_addr->weight).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION if(parent_addr) (void) fprintf(output. (void) fprintf(output. child_addr = child_addr->nextchild. while (doc_addr) { TERM WEIGHT\n"). } (void) fprintf(output. } (void) fprintf(output."\nCHILD TERMS: "). child_addr = inv_addr->children."\nDOCUMENT NUMBER doc_addr = inv_addr->doc."%-10. while (connect_term_add) { (void) fprintf(output. of first associated doc.doc_addr->doc).child_addr->term). connect_term_add-termadd-term). of first child /* printing all children */ */ (void) fprintf(output. */ */ /* find addr. /* loop through remaining childrend */ if(child_addr) (void) fprintf(output.. while (child_addr) { /* find addr.5f\n"." %s". doc_addr = doc_addr->nextdoc. } /* if the terms is connected then print the term from the other hierarchy */ (void) fprintf(output.. "). /* loop through remaining docs."%s". "). */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr."."%-30d".htm (103 of 114)7/3/2004 4:20:35 PM .

{ struct invert *inv_addr1.". **/ static void simple_merge(startinv1. struct connections *r1. inv_addr = inv_addr->nextterm. */ } } /*********************************************************************** simple_merge(startinv1. *r2 */ inv_addr1 = startinv1. "). } (void) fprintf(output. *inv_addr2..oks_Algorithms_Collection2ed/books/book5/chap09. if(connect_term_add) (void) fprintf(output. file */ /* looking for this term in the other inv. while (inv_addr1) { */ /* storage to hold info. about connected terms /* in: specifies start of 1st inv. startinv2) Returns: Purpose: void In this function.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION connect_term_add = connect_term_add-next_connection.htm (104 of 114)7/3/2004 4:20:35 PM . startinv2) struct invert *startinv1.. file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr."\n\n"). two terms in different hierarchies are merged /* loop to next term in inverted file if they are identical. file */ /* in: specifies start of 2nd inv. file */ /* start with top of 1st inv. *startinv2.

inv_addr2->connect = r2. inv_addr1->connect = r1. r1->termadd = inv_addr2. startinv2) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. startinv2) Returns: Purpose: void In this routine any two terms in different hierarchies are merged if they have 'similar' parents and children.. } inv_addr1 = inv_addr1->nextterm. r2->termadd = inv_addr1. Similarity is computed and compared to a pre-fixed user specified THRESHOLD **/ static void complex_merge(startinv1. startinv2). } } /*********************************************************************** complex_merge(startinv1.htm (105 of 114)7/3/2004 4:20:35 PM .. if (inv_addr2) { /* if term was found then update connect */ r1 = get_mem_connections( ). r2-=>next_connection = inv_addr2->connect.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION inv_addr2 = find_term(inv_addr1->term. r1->next_connection = inv_addr1->connect. r2 = get_mem_connections( ).oks_Algorithms_Collection2ed/books/book5/chap09.

. legitimate . *r2. r2->termadd = inv1_addr. while(inv1_addr) { inv2_addr = startinv2. inv2_addr)) { /* this returns 1 of 2 terms are */ /* similar enough. inv1_addr->connect = r1. inv2_addr->connect = r2.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION struct invert *startinv1.. r1->termadd = inv2_addr. r2 = get_mem_connections(). file */ /* in: specifies start of 2nd inv. . file */ /* tracks current term in 2nd inv.htm (106 of 114)7/3/2004 4:20:35 PM . . /* in: specifies start of 1st inv. . file */ /* tracks current term in 1st inv. /* now begin at top of 2nd inv. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. while(inv2_addr) { /* begin at top of 1st inv. r2->next_connection = inv2_addr->connect. then connect the two terms */ r1 = get_mem_connections(). file */ */ struct connections *r1. file /* while addr. *inv2_addr. inv1_addr = startinv1. file */ */ */ if (compare(inv1_addr. *startinv2.oks_Algorithms_Collection2ed/books/book5/chap09. { struct invert *inv1_addr. /* tracks connected terms int compare ( ). r1->next_connection = inv1_addr->connect.

htm (107 of 114)7/3/2004 4:20:35 PM . /* tracks # of parents + children of 1st term /* tracks # of parents + children of 2nd term /* tracks # of common parents + children */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. file */ /* loop through 1st inv. else 0 **/ static int compare(p. struct childlist *child2. /* tracks childlist of 1st term */ /* trakcs childlist of 2nd term float count1.q) struct invert *p. /* tracks parentlist of 1st term */ */ */ /* addresses of two terms to be compared */ /* tracks parentlist of 2nd term *child1. A similarity value is computed and if it is greater than a THRESHOLD then 1 is returned... count. *parent2. } inv1_addr = inv1_addr->nextterm. file */ /*********************************************************************** compare(p. } } /* loop through 2nd inv. */ count2. { struct parentlist *parent1. *q.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION inv2_addr = inv2_addr->nextterm.q) Returns: Purpose: int Used to compare two terms for more than just equality.oks_Algorithms_Collection2ed/books/book5/chap09.

parent2 = parent2->nextparent. while(child1) { /* loop through children of 1st term */ */ */ */ count1 = count1 + 1..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION count = 0. } /* next compute # of parents for q->term parent2 = q->parents. parent2 = q->parents.MAXWORD)) count = count + 1.0.oks_Algorithms_Collection2ed/books/book5/chap09. parent2 = parent2->nextparent. while(parent2) { /* loop through parents of second term */ if(!strncmp(parent1->term. while (parent2) { /* loop through parents of 2nd term again count2 = count2 + 1.. count2 = 0. while(parent1) { /* parent of 1st term */ count1 = count1 + 1.0. } parent1 = parent1->nextparent.0.0.parent2->term.0. } /* now check # of children for p-term & the # of common children child1 = p->children. count1 = 0. /* initialize all counts */ */ /* first check # of parents for p-term & the # of common parents parent1 = p->parents.htm (108 of 114)7/3/2004 4:20:35 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr.0.0.

} else return(0). while (child2) { /* loop through children of 2nd term */ */ count2 = count2 + 1. } /* next compute # of children for q->term child2 = q->children. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr.0. child2 = child2->nextchild.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION child2 = q->children.htm (109 of 114)7/3/2004 4:20:35 PM . } if if (count ! = 0.MAXWORD)) count = count + 1... } return(0).oks_Algorithms_Collection2ed/books/book5/chap09.0.0) { /* if there is anything in common at all ((count/(sqrt(count1 * count2))) = THRESHOLD) { */ */ /* printf("value is %f\n". return(1). } child1 = child1->nextchild. child2 = child2->nextchild.(count/(sqrt(count1*count2)))).child2->term. while(child2) { /* loop through children of 2nd term */ if (!strncmp(child1->term.

.. inv_addr = startinv. inv_addr = inv_addr-nextterm. If not found then returns NULL.oks_Algorithms_Collection2ed/books/book5/chap09.startinv) Returns: address of a struct invert record Purpose: Search for a specified term in the specified inverted file and return address of the corresponding record. while(inv_addr) { if (!strcmp(term. **/ static struct invert *find_term(term. startinv) char term[MAXWORD].inv_addr->term)) return(inv_addr). { struct invert *inv_addr. } /********************************************************************** *get_mem_invert( ) Returns: address of a struct invert record /* begin at top of inverted file */ /* in: term to be searched /* in: inverted file to search in */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION /********************************************************************** find_term(term. } return(NULL).htm (110 of 114)7/3/2004 4:20:35 PM . struct invert *startinv.

if (!record) { (void) fprintf(output. } return(record).. record = (struct invert *)malloc(sizeof(invfile)).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION Purpose: **/ dynamically obtain enough memory to store 1 invert record static struct invert *get_mem_invert () { struct invert *record.. record = (struct doclist *)malloc(sizeof(doclistfile))."\nout of memory\n").oks_Algorithms_Collection2ed/books/book5/chap09.htm (111 of 114)7/3/2004 4:20:35 PM address of a struct doclist record dynamically obtain enough memory to store 1 doclist record . } /*********************************************************************** *get_mem_doclist () Returns: Purpose: **/ static struct doclist *get_mem_doclist() { struct doclist *record. if (!record) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. return(NULL).

record = (struct parentlist *)malloc(sizeof(parentfile)). } return(record). return(NULL). } return(record). return(NULL). if (!record) { address of a struct parentlist record dynamically obtain enough memory to store 1 parentlist record. (void) fprintf(output."\nout of memory\n"). } /*********************************************************************** *get_mem_parentlist() Returns: Purpose: **/ static struct parentlist *get_mem_parentlist() { struct parentlist *record.Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION (void) fprintf(output. } / **************************************************************************** file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr..."\nout of memory\n").oks_Algorithms_Collection2ed/books/book5/chap09.htm (112 of 114)7/3/2004 4:20:35 PM .

address of a struct childlist record dynamically obtain enough memory to store 1 childlist record. record = (struct childlist *)malloc(sizeof(childfile)).htm (113 of 114)7/3/2004 4:20:35 PM ."\nout of memory\n"). if (!record) { (void) fprintf(output...oks_Algorithms_Collection2ed/books/book5/chap09. } /*********************************************************************** *get_mem_connections() Returns: Purpose: **/ static struct connections *get_mem_connections( ) { struct connections *record. return(NULL). } return(record).Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION *get_mem_childlst() Returns: Purpose: **/ static struct childlist *get_mem_childlist () { struct childlist *record. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. address of a struct connections record dynamically obtain enough memory to store 1 connections record.

} return(record).htm (114 of 114)7/3/2004 4:20:35 PM . } Go to Chapter 10 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr..oks_Algorithms_Collection2ed/books/book5/chap09..Information Retrieval: CHAPTER 9: THESAURUS CONSTRUCTION record = (struct connections *)malloc(sizeof(connectlist))."\nout of memory\n"). return(NULL). if (!record) { (void) fprintf(output.

references for related problems are given. it may be used for filtering of potential matches or for searching retrieval terms that will be highlighted in the output. which is probabilistic. at least n . We are interested in reporting all the occurrences. Moreover. the Knuth-Morris-Pratt (1977) algorithm. as well as the actual code of each algorithm. For example. including text editing. Baeza-Yates Department of Computer Science. whereas for the Knuth-Morris-Pratt (1977) algorithm it is two. University of Chile. It is well known that to search for a pattern of length m in a text of length n (where n > m) the search time is 0(n) in the worst case (for fixed m). However.. The string searching or string matching problem consists of finding all occurrences (or the first occurrence) of a pattern in a text. An extensive bibliography is also included. Experimental results for random text and one sample of English text are included. 10. where the pattern and the text are strings over some alphabet.1 INTRODUCTION String searching is an important component of many problems. the constant multiple in the naive algorithm is m.2 PRELIMINARIES file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. We include theoretical and empirical results. 10. We also survey the main theoretical results for each algorithm. Despite the use of indices for searching large amounts of text. and symbol manipulation. in the worst case..Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS CHAPTER 10: STRING SEARCHING ALGORITHMS Ricardo A.m + 1 characters must be inspected. for different algorithms the constant in the linear term can be very different. Although we only cover string searching. Santiago. string searching may help in an information retrieval system. the Shiftor algorithm from Baeza-Yates and Gonnet (1989).Books_Algorithms_Collection2ed/books/book5/chap10. in the worst case. Casilla 2777. We present the most important algorithms for string matching: the naive or brute force algorithm.htm (1 of 27)7/3/2004 4:20:41 PM . Chile Abstract We survey several algorithms for searching a string in a text. different variants of the Boyer-Moore (1977) algorithm. and the Karp-Rabin (1987) algorithm. For example. We use the C programming language described by Kernighan and Ritchie (1978) to present our algorithms. This result is due to Rivest (1977). data retrieval.

In each case. since a user rarely searches for a random string.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS We use the following notation: n: the length of the text m: the length of the pattern (string) c: the size of the alphabet * the expected number of comparisons performed by an algorithm while searching the pattern in a text of length n Theoretical results are given for the worst case number of comparisons. The results for English text are not statistically significant because only one text sample was used. The alphabet used was the set of lowercase letters. and we expect that other English text samples will give similar results. To determine the number of comparisons. "It might be argued that the average case taken over random strings is of little interest.Books_Algorithms_Collection2ed/books/book5/chap10. Our experimental results agree with those presented by Davies and Bowsher (1986) and Smit (1982). and both the text and the pattern were chosen uniformly and randomly from an alphabet of size c. and the execution time. and the average number of comparisons between a character in the text and a character in the pattern (text pattern comparisons) when finding all occurrences of the pattern in the text.000. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. 100 runs were performed.htm (2 of 27)7/3/2004 4:20:41 PM . and the algorithm obviously must compare every character of the text in those places where the pattern does occur. However. the text was of length 40. and punctuation symbols. In the case of random text. giving 32 characters.. Alphabets of size c = 4 (DNA bases) and c = 30 (approximately the number of lowercase English letters) were considered. for almost all algorithms.. However. Unsuccessful searches were not considered. (1977). The emperical data. For the case of English text we used a document of approximately 48.000 patterns. consists of results for two types of text: random text and English text. The execution time was measured while searching l.000 characters. where the average is taken uniformly with respect to strings of length n over a given finite alphabet. with the patterns chosen at random from words inside the text in such a way that a pattern was always a prefix of a word (typical searches). patterns of lengths 2 to 20 were considered. because we expect unsuccessful searches to be faster than successfu1 searches (fewer comparisons on average). The two cost functions we measured were the number of comparisons performed between a character in the text and a character in the pattern." Our experimental results show that this is the case. some digits. they show the correlation of searching patterns extracted from the same text. Quoting Knuth et al. this model is a reasonable approximation when we consider those pieces of text that do not contain the pattern.

Books_Algorithms_Collection2ed/books/book5/chap10. m ) /* Search pat[1.3 THE NAIVE ALGORITHM The naive. the probability of two characters being equal is 1/c. j++ ) k++.. k. The idea consists of trying to match any substring of length m in the text with the pattern (see Figure 10. { int i. (1977) and Schaback (1988).. Our random text model is similar to the one used in Knuth et al. algorithm is the simplest string matching method. naivesearch( text. That is. pat. j.htm (3 of 27)7/3/2004 4:20:41 PM . if( j > m ) Report_match_at_position( i-j+1 ). pat[]. For example. lim. lim = n-m+1. j<=m && text[k] == pat[j]. n. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.1). for( i = 1. for( j=1.m] in text[1. i<= lim. or brute force.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS We define a random string of length as a string built as the concatenation of characters chosen independently and uniformly from .. i++ ) /* Search */ { k = i.. int n. the probability of finding a match between a random text of length m and a random pattern of length m is The expected number of matches of a random pattern of length m in a random text of length n is 10.n] */ char text[]. m.

1: The naive or brute force string matching algorithm The expected number of text pattern comparisons performed by the naive or brute force algorithm when searching with a pattern of length m in a text of length n (n m) is given by Baeza-Yates (1989c) as This is drastically different from the worst case mn. does not depend on the length of the pattern. . . . To accomplish this. . we consider the maximal matching prefix of the pattern such that the next character in the pattern is different from the character of the pattern that caused the mismatch. Morris. In fact.Books_Algorithms_Collection2ed/books/book5/chap10. the pattern is preprocessed to obtain a table that gives the next position in the pattern to be processed after a mismatch.i + k] for k = 1. the expected number of comparisons performed by this algorithm (search time only) is bounded by The basic idea behind this algorithm is that each time a mismatch is detected. i . it is always possible to arrange the algorithm so that the pointer in the text is never decremented. 10.htm (4 of 27)7/3/2004 4:20:41 PM . is the first algorithm for which the constant factor in the linear term. .1) and pattern[i] pattern[j]} for j = 1. Moreover. In other words. m. The exact definition of this table (called next in Knuth et al. Example 1 The next table for the pattern abracadabra is a b r a c a d a b r a file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. . We can take advantage of this information instead of repeating comparisons with the known characters. the "false start" consists of characters that we have already examined. It is based on preprocessing the pattern in time O(m). [1977]) is next[j] = max{i|(pattern[k] = pattern [j .Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS } } Figure 10. and Pratt (1977) algorithm.2.4 THE KNUTH-MORRIS-PRATT ALGORITHM The classic Knuth. This algorithm is presented in Figure 10. discovered in 1970. . . in the worst case...

m+1.. The last value of the next table (five) is used to continue the search after a match has been found. initnext( pat. int next[MAX_PATTERN_SIZE]. In the worst case. Further explanation of how to preprocess the pattern in time O(m) to obtain this table can be found in the original paper or in Sedgewick (1983.. pat.m] in text[1. resume.3). pat[].htm (5 of 27)7/3/2004 4:20:41 PM . kmpsearch( text. n. int n.. we have to advance one position in the text and start comparing again from the beginning of the pattern. k. j++. next ). m. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Books_Algorithms_Collection2ed/books/book5/chap10..n] */ char text[].Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS next[j] 0 1 1 0 2 0 2 0 1 1 0 5 When the value in the next table is zero. /* Preprocess pattern */ resume = next[m+1]. do { /* Search */ if( j==0 || text[k]==pat[j] ) { k++. pat[m+1] = CHARACTER_NOT_IN_THE_TEXT. { int j. m ) /* Search pat[1. see Figure 10. the number of comparisons is 2n + O(m). j = k = 1.

{ int i. j = next[1] = 0. } Figure 10. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. m. J = resume. i = 1. if( j > m ) { Report_match_at_position( k-m ). int m. j++... next[]. j.Books_Algorithms_Collection2ed/books/book5/chap10.htm (6 of 27)7/3/2004 4:20:42 PM .2: The Knuth-Morris-Pratt algorithm initnext( pat. } } while( k <= n ). next ) /* Preprocess pattern of length m */ char pat[]. do { if( j == 0 || pat[i] == pat[j] ) { i++. pat[m+1] = END_OF_STRING.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS else j = next[j].

The match heuristic is obtained by noting that when the pattern is moved to the right. The Boyer-Moore (BM) algorithm positions the pattern over the leftmost characters in the text and attempts to match it from right to left.. an amount by which the pattern is moved to the right before a new matching attempt is undertaken. then the pattern has been found. searching is faster than average. bring a different character to the position in the text that caused the mismatch. Figure 10. } next[i] = j. Their main idea is to search from right to left in the pattern. The shift can be computed using two heuristics: the match heuristic and the occurrence heuristic.htm (7 of 27)7/3/2004 4:20:42 PM . and 2.5 THE BOYER-MOORE ALGORITHM Also in 1977. it must 1. } while( i <= m ) . the other classic algorithm was published by Boyer and Moore (1977). 10. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. With this scheme. was developed by Aho and Corasick (1975) . If no mismatch occurs. Otherwise. Variations for the Aho and Corasick algorithm are presented by Bailey and Dromey (1980) and Meyer (1985). that is. next[i] = next[j].Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS if( pat[i] != pat[j] ) else } else j = next[j]. match all the characters previously matched. Variations that compute the next table "on the fly" are presented by Barth (1981) and Takaoka (1986).3: Pattern preprocessing in the Knuth-Morris-Pratt algorithm An algorithm for searching a set of strings. the algorithm computes a shift. similar to the KMP algorithm. However the space used and the preprocessing time to search for one string is improved in the KMP algorithm.Books_Algorithms_Collection2ed/books/book5/chap10..

The formal definitions are Example 2 The table for the pattern abracadabra is a ddhat [j] 17 b 16 r 15 a 14 c 13 a 12 d 11 a 13 b 12 r 4 a 1 The occurrence hueristic is obtained by noting that we must align the position in the text that caused the mismatch with the first character of the pattern that matches it. j. pat[].. initd( pat. bmsearch( text. we have d[x] = min{s|s = m or (0 s < m and pattern [m .Books_Algorithms_Collection2ed/books/book5/chap10.4 for the code to compute both tables (i. initdd( pat.e. k = m.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS The last condition is mentioned in the Boyer-Moore paper (1977).. we call the original shift table dd. m ) char text[]. int n.. { int k. and the improved version . from the pattern. m. Formally calling this table d.m] in text[1.htm (8 of 27)7/3/2004 4:20:42 PM .. skip.. d[MAX_ALPHABET_SIZE]. See Figure 10. m. Following the later reference. m.n] */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. but was introduced into the algorithm by Knuth et al. d ). skip = dd[1] + 1. (1977). dd ). int dd[MAX_PATTERN_SIZE]. pat. while( k <= n ) /* Search */ /* Preprocess the pattern */ /* Search pat[1.s] = x)} and d) for every symbol x in the alphabet. n.

as it is for the KMP algorithm. Hence. the algorithm chooses the larger one. k += skip. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.4: The Boyer-Moore algorithm Example 3 The d table for the pattern abracadabra is d['a'] = 0 d['b'] = 2 d['c'] = 6 d['d'] = 4 d['r'] = 1 and the value for any other character is 11.. } } Figure 10. However. (1977) the preprocessing of the pattern is shown to be linear in the size of the pattern. } if( j == 0 ) { Report_match_at_position( k+1 ). see Figure 10. The corrected version can be found in Rytter's paper (1980. } else k += max( d[text[k]].Books_Algorithms_Collection2ed/books/book5/chap10.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS { j = m.htm (9 of 27)7/3/2004 4:20:42 PM . The same shift strategy can be applied after a match.. their algorithm is incorrect. the space needed is m + c + O(1). dd[j] ). while( j>0 && text[k] == pat[j] ) { j--. In Knuth et al. k--. Given these two shift functions. Both shifts can be precomputed based solely on the pattern and the alphabet.5).

d ) /* Preprocess pattern of length m : d table */ char pat[]. for( k=1. j--. k++ ) d[pat[k]] = m-k. k. dd[]. q1.. for( k=1. while( t <= m && pat[j] != pat[t] ) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. int f[MAX_PATTERN_SIZE+1]. int m.htm (10 of 27)7/3/2004 4:20:42 PM . t-. int m. { int k. d[].Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS initd( pat. k++ ) d[k] = m. m. k++ ) dd[k] = 2*m-k. { int j. k<=m. t. for( k=0.. k<=m.) /* setup the dd hat table */ { f[j] = t. dd ) /* Preprocess pattern of length m : dd hat table */ char pat[]. m. for( j=m. j > 0. t1. k = MAX_ALPHABET_SIZE. q. t=m+1.ooks_Algorithms_Collection2ed/books/book5/chap10. } initdd( pat.

j++ ) { f[j] = t1. k++ ) dd[k] = min( dd[k].. A simpler alternative proof can be found in a paper by Guibas and Odlyzko (1980). j <= t. In the best case Cn = n/m. m-j ). } } q = t. A variant of the BM algorithm file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. (1977) have shown that. } } Figure 10. /* Rytter's correction */ for( j=1. (n) matches. q = q + t .htm (11 of 27)7/3/2004 4:20:42 PM . q1 = 1.q. while( t1 >= 1 && pat[j] != pat[t1] ) t1 = f[t1].5: Preprocessing of the pattern in the Boyer-Moore algorithm Knuth et al. t = m + 1 . t1=0. t = f[t].. this algorithm can be as bad as the naive algorithm when we have many matches. t1++. } while( q < m ) { for( k=q1.f[t].ooks_Algorithms_Collection2ed/books/book5/chap10. the number of comparisons is O(n + rm). Hence. Some experiments in a distributed environment are presented by Moller-Nielsen and Straunstrup (1984). m+q-k ). where r is the total number of matches. namely. Our simulation results agree well with the emprical and theoretical results in the original Boyer-Moore paper (1977). t = f[t].Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS dd[t] = min( dd[t]. k<=q. in the worst case. q1 = q + 1.

corresponding to the last character in the pattern.5. instead of going from m to 1 in the comparison loop. To avoid a comparison when the value in the table is zero (the last character of the pattern). it does not make sense to write a simplified version that uses Galil's improvement because we need O(m) space to compute the length of the overlapping characters. Horspool noted that when we know that the pattern either matches or does not. for the pattern ababa. That is. Galil (1979) modifies the algorithm so that it remembers how many overlapping characters it can have between two successive matches. Recently. according to empirical results. Boyer-Moore type algorithms to search a set of strings are presented by Commentz-Walter (1979) and Baeza-Yates and Regnier (1990). the algorithm goes from m to k. the space depends only on the size of the alphabet (almost always fixed) and not on the length of the pattern (variable). algorithm) is obtained by using only the occurrence heuristic. . The main reason behind this simplification is that. This algorithm is truly linear. For example.m + 1 comparisons. the extra space needed decreases from O(m + c) to O(c). or k = 1 otherwise. the worst case is now O(mn). That is. it only improves the average case for small alphabets. any of the characters from the text can be used to address the heuristic table. However. or BMH. Of course. in practice. patterns are not periodic.ooks_Algorithms_Collection2ed/books/book5/chap10. and based on empirical results showed that this simpler version is as good as the original Boyer-Moore algorithm. and then we compute the occurence heuristic table for only the first m . Cole (1991) proved that the exact worst case is 3n + O(m) comparisons. or SBM. Also. To improve the worst case. algorithm.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS when m is similar to n is given by Iyengar and Alia (1980). the same results show that for almost all pattern lengths this algorithm is better than algorithms that use a hardware instruction to find the occurrence of a designated character.1 characters of the pattern. Horspool (1980) improved the SBM algorithm by addressing the occurrence table with the character in the text corresponding to the last character of the pattern. but it will be faster on the average. we define the initial value of the entry in the occurrence table. 10. as expected..1 The Simplified Boyer-Moore Algorithm A simplified version of the Boyer-Moore algorithm (simplified-Boyer-Moore. at the cost of using more instructions. Then. where if the last event was a match.htm (12 of 27)7/3/2004 4:20:42 PM . Recently.5.. Apostolico and Giancarco (1986) improved this algorithm to a worst case of 2n . 10. Formally file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. of the longest proper prefix that is also a suffix of the pattern. Based on this. We call this algorithm the Boyer-Moore-Horspool. as m. we compute the length. with a worst case of O(n + m) comparisons. For the same reason. Moreover.2 The Boyer-Moore-Horspool Algorithm Horspool (1980) presented a simplification of the Boyer-Moore algorithm.

m] in text[1.n] */ char text[]. as noted by Baeza-Yates (1989c) and Sunday (1990). the order of the comparisons is not relevant. bmhsearch( text. k += d[text[k+m]] ) /* Searching */ { i=k.6 where MAX_ALPHABET_SIZE is the size of the alphabet. for( k=0. k++ ) d[k] = m+1. k <= lim. . m. j.htm (13 of 27)7/3/2004 4:20:42 PM . for( k=1. The code for an efficient version of the Boyer-Moore-Horspool algorithm is extremely simple and is presented in Figure 10.ooks_Algorithms_Collection2ed/books/book5/chap10. { int d[MAX_ALPHABET_SIZE]. Further improvements are due to Hume and Sunday (1990). /* Preprocessing for( k=1. k++ ) d[pat[k]] = m+1-k. m ) /* Search pat[1 . pat[]. This algorithm also includes the idea of using the character of the text that corresponds to position m + 1 of the pattern. pat. k<=m. int n. In this algorithm. k.. a modification due to Sunday (1990). Thus. the algorithm compares the pattern from left to right. n. lim. i. */ */ */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. k<MAX_ALPHABET_SIZE.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS d[x] = min{s|s = m or (1 s < m and pattern[m . /* To avoid having code /* for special case n-k+1=m lim = n-m+1... pat[m+1] = CHARACTER_NOT_IN_THE_TEXT.s] = x)} Example 4 The d table for the pattern abracadabra is d['a'] = 3 d['b'] = 2 d['c'] = 6 d['d'] = 4 d['r'] = 1 and the value for any other character is 11.

7 shows. Each search step costs a small number of arithmetic and logical operations. and is due to Baeza-Yates and Gonnet (1989). 1989a) and Sunday (1990).htm (14 of 27)7/3/2004 4:20:42 PM .. Hence.ooks_Algorithms_Collection2ed/books/book5/chap10. ) extra space and O(m + ) preprocessing time. } /* restore pat[m+1] if necessary */ } Figure 10.) Figure 10. for the algorithms studied up to this point. (The version given here is slower but simpler. shifts. except that the Knuth-Morris-Pratt algorithm was implemented as suggested by their authors. if( j == m+1 ) Report_match_at_position( k ). Improvements to the BMH algorithm for searching in English text are discussed by Baeza-Yates (1989b. for small patterns we have an O(n) time algorithm using O( denotes the alphabet. Figure 10. The codes used are the ones given in this chapter. j++ ) i++. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. and additions are used.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS for( j=1. where The main properties of the shift-or algorithm are: Simplicity: the preprocessing and the search are very simple.6 THE SHIFT-OR ALGORITHM The main idea is to represent the state of the search as a number.7: Expected number of comparisons for random text (c = 4) 10. Also. provided that the numbers are large enough to represent all possible states of the search. A hybrid algorithm that combines the BMH and KMP algorithms is proposed by Baeza-Yates (1989c).. the BMH algorithm is simpler and faster than the SBM algorithm. the expected number of comparisons per character for random text with c = 4. and is as good as the BM algorithm for alphabets of size at least 10. it is not difficult to prove that the expected shift is larger for the BMH algorithm. and only bitwise logical operations. text[i] == pat[j].6: The Boyer-Moore-Horspool-Sunday algorithm Based on empirical and theoretical analysis.

No buffering: the text does not need to be stored. It is worth noting that the KMP algorithm is not a real time algorithm. . The definition of the table T is file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. update the individual states according to the new character. . or equivalently.. . where the state is 0 if the last i characters have matched. . That match ends at the current position. this sets the initial state of s1 to be 0 by default. This algorithm is based on finite automata theory. we use a vector of m different states. . if state < 2m-1. For this we use a table T that is defined by preprocessing the pattern with one entry per alphabet symbol. or 1 if not.htm (15 of 27)7/3/2004 4:20:42 PM . we must: shift the vector state 1 bit to the left to reflect that we have advanced one position in the text. Instead of trying to represent the global state of the search as previous algorithms do. We report a match if sm is 0. i of the pattern and the positions ( j . j of the text. In practice.i + 1). as in the BM algorithm. and the BM algorithm needs to buffer the text. . . as the KMP algorithm. gives the new state. and the bitwise operator or that. where j is the current position in the text. To update the state after reading a new character in the text. given the old vector state and the table value.ooks_Algorithms_Collection2ed/books/book5/chap10.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS Real time: the time delay to process one text character is bounded by a constant. We can then represent the vector state efficiently as a number in base 2 by where the si are the individual states. All these properties indicate that this algorithm is suitable for hardware implementation. . where state i tells us the state of the search between the positions 1. We use one bit to represent each individual state. and also exploits the finiteness of the alphabet. Each search step is then: state = (state << 1) or T [curr char] where << denotes the shift left operation.

where is the time to compute a shift or other simple operation on numbers of m bits using a word size of w bits. We set up the table by preprocessing the pattern before the search. The initial state is 11111. m ) /* Search pat[1.htm (16 of 27)7/3/2004 4:20:42 PM . we have O(n) time for the worst and the average case. The entries for table T (one digit per position in the pattern) are then: T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111 We finish the example by searching for the first occurrence of ababc in the text abdabababc. sosearch( text. in this case.ooks_Algorithms_Collection2ed/books/book5/chap10. d} be the alphabet. The match at the end of the text is indicated by the value 0 in the leftmost bit of the state of the search.. respectively.n] */ register char *text. c.m] in text[1. only words Therefore. In practice.. WORD: word size in bits (32 in our case). the state 10101 means that in the current position we have two partial matches to the left. for small patterns (word size 32 or 64 bits).. B: number of bits per individual state. n. one. This can be done in time. The complexity of the search time in both the worst and average case is . where (C) is 0 if the condition C is true. Example 5 Let {a. we need m are needed. The programming is independent of the word size insofar as possible. For example.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS for every symbol x of the alphabet. Figure 10. We use the following symbolic constants: MAXSYM: size of the alphabet. b. of lengths two and four. text : a b d a b a b a b c T[x] : 11010 10101 11111 11010 10101 11010 10101 11010 10101 01111 state: 11110 11101 11111 11110 11101 11010 10101 11010 10101 01111 For example.8 shows an efficient implementation of this algorithm. and ababc the pattern. 128 for ASCII code.. pat. but if the word size is at least m. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. bits of extra memory. and 1 otherwise.

unsigned int T[MAXSYM]. end = text+n+1. for( start=text.. lim.htm (17 of 27)7/3/2004 4:20:42 PM .Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS char pat[]. if( m > WORD ) Abort( "Use pat size <= word size" ). register unsigned int state. >>B). i<MAXSYM. lim |= j. i++ ) T[pat[i]] &= lim = ~(lim ~j. text end. j. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. j <<= B. int n. i<=m. state = ~0. text++ ) { state = (state <<B) | T[*text]. for( i=0.. m. /* Preprocessing */ for( lim=0. i. char *start. i=1. { register char *end. i++ ) T[i] = ~0. /* Next state */ if( state < lim ) Report_match_at_position( text-start-m+2 ). /* Search */ /* Initial state */ text++.ooks_Algorithms_Collection2ed/books/book5/chap10. j=1.

initial = do { state = initial. Another implementation is possible using the bitwise operator and instead of the or operation. The speed of this version depends on the frequency of the first letter of the pattern in the text. Figure 10. do { state = (state << B) | T[*text]. } while( text < end ). text++. By just changing the definition of table T we can search for patterns such that every pattern position is: a set of characters (for example. start = text. and Pratt (1977). The empirical results for this code are shown in Figures 10.. first = pat[1].9. } while( state != initial ). while( text < end && *text != first ) /* Scan */ text++.9: Shift-Or algorithm for string matching *Based on implementation of Knuth. /* Search */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.8: Shift-Or algorithm for string matching (simpler version) The changes needed for a more efficient implementation of the algorithm (that is. match a vowel). Morris.htm (18 of 27)7/3/2004 4:20:42 PM .. scan the text until we see the first character of the pattern) are shown in Figure 10. and complementing the value of Tx for all x . /* Next state */ if( state < lim ) Report_match_at_position( text-start-m+2 ).Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS } Figure 10.11 and 10. ~0.ooks_Algorithms_Collection2ed/books/book5/chap10.12.

htm (19 of 27)7/3/2004 4:20:42 PM . Their method is based on computing the signature function for position i given the value for position i . The code for the case d = 128 (ASCII) and q = 16647133 for a word size of 32 bits.. All that is necessary is to compute the signature function of each possible m-character substring in the text and check if it is equal to the signature function of the pattern.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS a "don't care" symbol (match any character). as suggested by Harrison (1971). m ) /* Search pat[1. where q is a large prime. 10. By using a power of 2 for d (d c ). The prime q is chosen as large as possible. we can have "don't care" symbols in the text. without using extra space. using large alphabets. To obtain the signature value of the next position. Furthermore. In our empirical results we observed only 3 collisions in 107 computations of the signature function. n. the multiplications by d can be computed as shifts. We also impose the condition that d is a primitive root mod q. or the complement of a set of characters. This implies that the signature function has maximal cycle. To ensure that there is a match..10 (D = log2 d and Q = q). Karp and Rabin (1987) found an easy way to compute these signature functions efficiently for the signature function h(k) = k mod q. Theoretically.1. based in Sedgewick's exposition (1983).. The signature function represents a string as a base-d number. this algorithm may still require mn steps in the worst case. but using a large value for q makes collisions unlikely [the probability of a random collision is O(1 /q)]. The algorithm requires time proportional to n + m in almost all cases.. as shown in Gonnet and Baeza-Yates (1990). if we check each potential match and have too many matches or collisions.m] in text[1. This idea has been recently extended to string searching with errors and other variants by Wu and Manber (l991). the period of the signature function is much larger than m for any practical case. we must make a direct comparison of the substring with the pattern. pat.7 THE KARP-RABIN ALGORITHM A different approach to string searching is to use hashing techniques. Note that this algorithm finds positions in the text that have the same signature value as the pattern. where d = c is the number of possible characters. This algorithm is probabilistic. only a constant number of operations is needed. such that (d + 1 )q does not cause overflow. rksearch( text.ooks_Algorithms_Collection2ed/books/book5/chap10.n] */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. that is Thus. is given in Figure 10.

. h2. /* check */ /* true match */ /* update the signature */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. i <= n-m+1. if( j > m ) Report_match_at_position( i ). i<=m.. i++ ) { if( h1 == h2 ) { /* Potential match */ /* Compute the signature /* of the pattern and of /* the beginning of the /* text */ */ */ */ /* Search */ for(j=1. i. /* (0 m = n) */ for( i=1.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS char text[]. i<m. dM = 1. } h2 = (h2 + (Q << D) . } for( i = 1.ooks_Algorithms_Collection2ed/books/book5/chap10. i++ ) dM = (dM << D) % Q. h2 = ((h2 << D) + text[i] ) % Q. pat[]. { int h1. i++ ) { h1 = ((h1 << D) + pat[i] ) % Q. dM. for( i=1. m.htm (20 of 27)7/3/2004 4:20:42 PM .text[i]*dM ) % Q. j++ ). int n. j<=m && text[i-1+j] == pat[j]. j. h1 = h2 = O.

} } Figure 10. /* update the signature value */ and overflow is ignored. this algorithm is slow due to the multiplications and the modulus operations. which depends on the alphabet size and/or the pattern size.11. the Horspool version of the Boyer-Moore algorithm is the best algorithm. Figure 10. where r is the size. For this reason. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The results are similar. Based on the empirical results. of a word.10) is h2 = h2*D .text[j-m]*dM + text[i+m]. we use the maximum value of an integer (determined by the word size) for q.8 CONCLUSIONS We have presented the most important string searching algorithms. In all the other cases. in particular for long texts. according to execution time.10: The Karp-Rabin algorithm /* of the text */ In practice. then the Knuth-Morris-Pratt algorithm is a good choice. it becomes competitive for long patterns. for r from 8 to 64.12 shows the same empirical results as Figure 10. Finally. an adequate value for d is 31. With these changes. it is clear that Horspool's variant is the best known algorithm for almost all pattern lengths and alphabet sizes. In other words.. for almost all pattern lengths. The value of d is selected such that dk mod 2r has maximal cycle length (cycle of length 2r-2).11 shows the execution time of searching 1000 random patterns in random text for all the algorithms considered. We can avoid the computation of the modulus function at every step by using implicit modular arithmetic given by the hardware. For example. in bits. the given results are for the efficient version. if the pattern is small (1 to 3 characters long) it is better to use the naive algorithm. The results for the Karp-Rabin algorithm are not included because in all cases the time exceeds 300 seconds. Figure 10. However. 10. In this way. For the shift-or algorithm. but for English text instead of random text. the evaluation of the signature at every step (see Figure 10. we use two multiplications instead of one multiplication and two modulus operations. If the alphabet size is large. The main drawback of the Boyer-Moore type algorithms is the preprocessing time and the space required..ooks_Algorithms_Collection2ed/books/book5/chap10. with c = 30.htm (21 of 27)7/3/2004 4:20:42 PM .Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS h2 = ((h2 << D) + text[i+m] ) % Q. the Boyer-Moore algorithm is better.

Galil and Seiferas (1980. see Gonnet and Baeza-Yates (1991). REFERENCES AHO. (1987). See also Weiner (1973). A. Cheng and Fu (1987). For other kinds of indices for text.. Book. they are not space optimal in the worst case because they use space that depends on the size of the pattern. Majster and Reiser (1980). the main advantage of this algorithm is that we can search for more general patterns ("don't care" symbols. see Faloutsos (1985).) They also show that the delay between reading two characters of the text is bounded by a constant. (See also Slisenko [1980. Optimal parallel algorithms for string matching are presented by Galil (1985) and by Vishkin (1985). which is interesting for any real time searching algorithms (Galil 1981).htm (22 of 27)7/3/2004 4:20:42 PM . and Kemp et al. Figure 10. or both. [1989] and Kedem et al. Blumer et al. etc. This solution needs O(n) extra space and O(n) preprocessing time. McCreight (1976). [1989]. pp. London: Academic Press. the size of the alphabet. However. A.. complement of a character. Gonnet (1983). R. 1987). Hollaar 1979). 1975. we can search a string in worst case time proportional to its length. 1983]." file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." in Formal Language Theory: Perspectives and Open Problems.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS The shift-or algorithm has a running time similar to the KMP algorithm. only the preprocessing is different.ooks_Algorithms_Collection2ed/books/book5/chap10. If we allow preprocessing of the text. AHO.) Figure 10. However. Practical algorithms that achieve optimal worst case time and space are presented by Crochemore and Perrin (1988.11: Simulation results for all the algorithms in random text (c = 30) The linear time worst case algorithms presented in previous sections are optimal in the worst case with respect to the number of comparisons (see Rivest [1977]).. 1980. 1983) show that it is possible to have linear time worst case algorithms using constant space. where n is the size of the text. ed. "Pattern Matching in Strings. (See also Berkman et al. 1989). CORASICK. This is achieved by using a Patricia tree (Morrison 1968) as an index. (1985. (1985). and M. For example.) using exactly the same searching time (see Baeza-Yates and Gonnet [1989]). (1985). For further references and problems. 325-47. Aho and Corasick machines (see Aoe et al.12: Simulation results for all the algorithms in English text Many of the algorithms presented may be implemented with hardware (Haskin 1980. and Wakabayashi et al. "Efficient String Matching: An Aid to Bibliographic Search.

Conf. GALIL. Proc. 1989b. T. "A New Approach to Text Searching." Inf.. Cambridge. 34. Santoro. A. "The Boyer-Moore-Galil String Searching Strategies Revisited. BARTH. thesis. SCHIEBER. Z.." Software-Practice and Experience. Mass. and M. M. VISHKIN. BRESLAUER. "Fast String Searching by Finding Subkeys in Subtext.-R. R.. J. Theorefical Computer Science. "An Alternative for the Implementation of Knuth-Morris-Pratt Algorithm. 333-40. EHRENFEUCHT. R. 1985. 134-37. pp. HAUSSLER. pp. of 12th ACM SIGIR. BAEZA-YATES. BAEZA-YATES. eds. BAEZA-YATES.. "An Efficient Implementation of Static String Pattern Matching Machines. and J. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.D. Canada. BAILEY. YAMAMOTO. and N. 168-75.. 257-71. Karlsson and J. R. 491-98. 1985. J. Y. 11. and R. 1980. eds. Letters. pp. J. Springer Verlag Lecture Notes on Computer Science 382. HAUSSLER." in Workshop in Algorithms and Data Structures. "Completed Inverted Files for Efficient Text Retrieval and Analysis". 3155. MCCONNELL." in 2nd Scandinavian Workshop in Algorithmic Theory. BLUMER. SEIFERAS. (To appear in Communications of ACM). B. REGNIER. 13. and R. A. Dehne. BLUMER D." SIAM J on Computing. Bergen. 1989. 20th ACM Symp. pp. University of Waterloo. "The Smallest Automaton Recognizing the Subwords of a Text. 75-96. 1990.. Ottawa. 332-47." Inf." in IEEE Int. "Improved String Searching. 130-33.. St.. D. vol. R. 18. SHIMADA. 15. AOE. BERKMAN. A. and R. A. JACM. 1989c." in Proc. 19(3). pp. Sack. on Theory of Computing. R. BAEZA-YATES.htm (23 of 27)7/3/2004 4:20:42 PM . J. Lecture Notes in Computer Science 447. Also as Research Report CS-89-17. Ph. Gilbert. 578-95. Petersburg. O. and U. SWAT'90. "String Searching Algorithms Revisited. G. CHEN. GIANCARLO. "Highly Parellelizable Problems. 309-19. of Computer Science. 1981. pp. Dept. Efficient Text Searching. on Supercomputing Systems. 1987. 98-105.. 1989. 1. Proc. GONNET.ooks_Algorithms_Collection2ed/books/book5/chap10. Washington. APOSTOLICO. "Fast Algorithms for Two Dimensional and Multiple Pattern Matching. 1989a. Fla.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS Communications of the ACM. 1986. D. EHRENFEUCHT. and G. Seattle.. BLUMER. BAEZA-YATES. F. Letters. BLUMER. R." in Proc. and R. A. 40. DROMEY. Norway: Springer-Verlag.

22. M. FU. 49-74. 1991. 1985." JACM." ACM C. Z. "On Improving the Worst Case Running Time of the Boyer-Moore String Matching Algorithm.. 1988. H. San Francisco. 417 .. 1979. 224-33. Gesu. Z." Information and Control. SIAM. 762-72. Lecture Notes in Computer Science 71. 20. 16." in ACM PODS. "Optimal Parallel Algorithms for String Matching. Z. Lecture Notes in Computer Science 324. 118-32. "String Matching with Constraints. 1977... PERRIN. 1986. 67. "Algorithms for Pattern Matching. and D. CHENG. COMMENTZ-WALTER. and S. ed. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." CACM. "Vlsi Architectures for String Matching and Pattern Matching." in Mathematical Foundations of Computer Science. 2. and D. 28. 1987. GALIL. 9. T. 1983. Czechoslovakia. University. G." Software-Practice and Experience. COLE. R. "Unstructured Data Bases or Very Efficient Text Searching. DAVIES.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS BOYER. 26. and S. GALIL.. Cal. "Saving Space in Fast String-Matching." Pattern Recognition.. M. "String Matching in Real Time. SEIFERAS. Springer-Verlag. GALIL. BOWSHER. 14457. CROCHEMORE. Surveys. "Two Way Pattern Matching. GALIL. D. 67-79. GALIL. 1979. "A String Matching Algorithm Fast on the Average.ooks_Algorithms_Collection2ed/books/book5/chap10. 505-08.. vol. C. SEIFERAS. and J. pp.P. MOORE. B. 1981. 280-94." SIAM J on Computing. "Pattern Matching in Strings." in ICALP. CROCHEMORE. "Time-Space-Optimal String Matching. R. 134-49. 1988. FALOUTSOS. Z. 17. "Access Methods for Text. G. 1983.. CROCHEMORE. CACM. L. GONNET. PERRIN.htm (24 of 27)7/3/2004 4:20:42 PM .38. pp. and K. 1985." pp. 1980. and J. Paris 7 (submitted for publication).. Z. Springer-Verlag.I. 1989. 2nd Symp. 125-41. "Tight Bounds on the Complexity of the Boyer-Moore String Matching Algorithm. SpringerVerlag." Technical Report 98-8.. "A Fast String Searching Algorithm. on Discrete Algorithms. 20. Carlesbad. 575-601." in 4th Conference on Image Analysis and Processing." JCSS. M.

California.. MORRIS. "Efficient Randomized Pattern-Matching Algorithms.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS pp.. "Text Retrieval Computers.htm (25 of 27)7/3/2004 4:20:42 PM . SUNDAY. J.: Prentice Hall. GONNET. IYENGAR.. 12. G. 14." Appl." in Workshop Computer Architecture for Non-Numeric Processing. and D. and U." Software . M.. R. HOLLAAR." AT&T Bell Labs Computing Science Technical Report No. 1977.. BAEZA-YATES. GONNET. Ga. BAYER. LANDAU. and R.. and V. and V.Practice and Experience. PALEM. "Implementation of the Substring Test by Hashing. 1980. Atlanta. 1980." in SPAA'89. 1987. Englewood Cliffs. To appear in Software-Practice and Experience. Comput. IBM J Res. 1990. pp. "An Analysis of the Karp-Rabin String Matching Algorithm. M. B. N. 1980. 24. G. SIAM J on Computing. KARP. PRATT. 1980. Santa Fe. A. RABIN. 1991. 10. KNUTH. 1991.. "Handbook of Algorithms and Data Structures. H. 501-06. R. 49-56. M.ooks_Algorithms_Collection2ed/books/book5/chap10. Information Processing Letters.. Addison-Wesley. 461-74. and M.. Z. ALIA. New Mexico. N. "Optimal Parallel Suffix-Prefix Matching Algorithm and Applications. 117-24.. RITCHIE. 40-50. GUNTZER. ODLYZKO. KERNIGHAN. 1989. 1978. L. Development. and BAEZA-YATES. 5. 777-79. 1971. 123-31. R. 249-60. and K. 156.. 672-82. "A New Proof of the Linearity of the Boyer-Moore String Searching Algorithm. R." IEEE Computer. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." 2nd Edition. "Hardware for Searching Very Large Text Databases. 6." Acta Informatica. "Fast Pattern Matching in Strings. The C Programming Language.. G. HUME. "Time Optimal Left to Right Construction of Position Trees. 31. S. 1979. 6. "Fast String Searching. HASKIN. HORSPOOL. and D. 271-74. 9. GUIBAS. 323-50. KEMP. "Practical Fast Searching in Strings. 1987. L. D.J.." CACM. KEDEM. R. Math." SIAM J on Computing. HARRISON. vol. and A. "A String Search Algorithm. pp.

1985. 262-72.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS MAJSTER. Letters. D.htm (26 of 27)7/3/2004 4:20:42 PM ." Communications of the ACM. "A Correct Preprocessing Algorithm for Boyer-Moore String-Searching. 17. 1980. and A. Mass. 1980. REISER. and J.ooks_Algorithms_Collection2ed/books/book5/chap10. SUNDAY. 1984. University of California. T.. MEYER. Proc. "Determination in Real Time of all the Periodicities in a Word. 1976. B." Software-Practice and Experience.. R. R." SIAM J on Computing. TAKAOKA. "PATRICIA-Practical Algorithm to Retrieve Information Coded in Alphanumeric. 9. and V." SIAM J on Computing. 12. 22. 329-30.. E.. 392-95. SMIT." JACM. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 1988. "On the Worst-Case Behavior of String-Searching Algorithms.. Math. 1982. P. "Efficient On-Line Construction and Correction of Position Trees. Reading. R. MORRIS. Sov. J. Algorithms." Inf ." Technical Report 40. 1983." J. 6. 18. PRATT.. 548-658. 1986. Berkeley. 1977. 132-42." Inf. Proc. SLISENKO." Sov. SLISENKO. Dokl." Inf. "An On-Line Pattern Matching Algorithm. M. 9. 21. MOLLER-NIELSEN. Proc . SEDGEWICK. 509-12. RYTTER. 219-27. 785-807. W.. 1980. 23. "Incremental String Matching. 1983. "A Comparison of Three String Matching Algorithms.: Addison-Wesley. 22. STAUNSTRUP. Letters. Math. 33(8). "Experiments with a Fast String Searching Algorithm. D. Letters. MCCREIGHT. 514-34. A. 57-66. G. 669-74." SIAM J on Computing. 21. "On the Expected Sublinearity of the Boyer-Moore Algorithm. 1316-86. 1990. A. 129-35. SCHABACK." JACM. Computing Center. RIVEST. 15. "Detection of Periodicities and String-Matching in Real Time." SIAM J on Computing. "A Space-Economical Suffix Tree Construction Algorithm. 1968. "A Linear Pattern Matching Algorithm. 1970. MORRISON. "A Very Fast Substring Search Algorithm.

1985. 1985. "Fast Text Searching With Errors. 1992... "Optimal Parallel Pattern Matching in Strings.htm (27 of 27)7/3/2004 4:20:42 PM . Department of Computer Science.ooks_Algorithms_Collection2ed/books/book5/chap10. WAKABAYASHI. 14. MANBER. S. To appear in Proceedings of USENIX Winter 1992 Conference. January. Go to Chapter 11 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. S.." Technical Report TR-91-11. "Design of Hardware Algorithms by Recurrence Relations. YOSHIDA." in FOCS. 8. P.." Systems and Computers in Japan. and U. University of Arizona. "Linear Pattern Matching Algorithm. WU. 1-11. 10-17." Information and Control. T. 1973. U. 91113. 67. pp. and N. KIKUNO. WEINER. San Francisco. vol. 1991.Information Retrieval: CHAPTER 10: STRING SEARCHING ALGORITHMS VISHKIN.

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY MODIFICATION TECHNIQUES
Donna Harman National Institute of Standards and Technology Abstract This chapter presents a survey of relevance feedback techniques that have been used in past research, recommends various query modification approaches for use in different retrieval systems, and gives some guidelines for the efficient design of the relevance feedback component of a retrieval system.

11.1 INTRODUCTION
Even the best of the information retrieval systems have a limited recall; users may retrieve a few relevant documents in response to their queries, but almost never all the relevant documents. In many cases this is not important to users, but in those cases where high recall is critical, users seldom have many ways to retrieve more relevant documents. As a first choice they can "expand" their search by broadening a narrow Boolean query or by looking further down a ranked list of retrieved documents. Often this is wasted effort: a broad Boolean search pulls in too many unrelated documents and the tail of a ranked list of documents contains documents matching mostly less discriminating query terms. The second choice for these users is to modify the original query. Very often, however, this becomes a random operation in that users have already made their best effort at a statement of the problem in the original query and are uncertain as to what modification(s) may be useful. Users often input queries containing terms that do not match the terms used to index the majority of the relevant documents (either controlled or full text indexing) and almost always some of the unretrieved relevant documents are indexed by a different set of terms than those in the query or in most of the other relevant documents. This problem has long been recognized as a major difficulty in information retrieval systems (Lancaster 1969). More recently, van Rijsbergen (1986) spoke of the limits of providing increasingly better ranked results based solely on the initial query, and indicated a need to modify that query to further increase performance. The vast majority of research in the area of relevance feedback and automatic query modification has dealt with a ranking model of retrieval systems, although some of this work has been adapted for the Boolean model. This chapter will deal almost exclusively with feedback and query modification based on the ranking model, and readers are referred to a special issue of Information Processing and Management (Radecki 1988) for discussion of relevance feedback and query modification in Boolean
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (1 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

models. Because relevance feedback is strongly related to ranking, readers unfamiliar with basic ranking models should read Chapter 14 before this chapter. Two components of relevance feedback have evolved in research. First, extensive work has been done in the reweighting of query terms based on the distribution of these terms in the relevant and nonrelevant documents retrieved in response to those queries. This work forms the basis of the probabilistic model for ranking (see Chapter 14 on Ranking Algorithms for a description of this model). Specifically it has been shown that query terms appearing in relevant documents should have increased term weights for later searching (and conversely terms in nonrelevant documents should have decreased term weights). The details of this reweighting appear later in this chapter. Note that this is a very similar idea to traditional feedback in cybernetics or in biologically based systems such as neural networks. A second component of relevance feedback or query modification is based on changing the actual terms in the query. Whereas query reweighting will increase the ranks of unretrieved documents that contain the reweighted terms, it provides no help for unretrieved relevant documents containing terms that do not match the original query terms. Various methods of query expansion have been tried and these methods are discussed later in the chapter. This chapter on relevance feedback is organized as follows. Section 11.2 is a discussion of the research in the area of relevance feedback and query modification, starting with early research in the SMART environment, discussing problems in evaluation, and then reviewing research using the two different components of relevance feedback: the reweighting of query terms and the addition of new query terms. The final part of section 11.2 describes research experiments involving other types of relevance feedback. Section 11.3 reviews the use of relevance feedback and query modification in operational retrieval systems, mostly in experimental prototypes. Section 11.4 contains recommendations for the use of relevance feedback in Boolean systems, in systems based on vector space search models, and in systems based on probabilistic indexing or ad hoc combinations of term-weighting schemes. Section 11.5 presents some thoughts on constructing efficient relevance feedback operations, an area that has received little actual experimental work, and section 11.6 summarizes the chapter.

11.2 RESEARCH IN RELEVANCE FEEDBACK AND QUERY MODIFICATION
11.2.1 Early Research
In their forward-looking paper published in 1960, Maron and Kuhns mentioned query modification, suggesting that terms closely related to the original query terms can be added to the query and thus retrieve more relevant documents. Relevance feedback was the subject of much experimentation in the early SMART system (Salton 1971). In 1965 Rocchio published (republished as Rocchio 1971) some experiments in query modification that combined term reweighting and query expansion. Based on what

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (2 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

is now known as the vector space model (for more on the vector space model, see Chapter 14 on Ranking Algorithms), he defined the modified query to be

where Q0 = the vector for the initial query Ri = the vector for relevant document i Si = the vector for nonrelevant document i n
1

= the number of relevant documents n2 = the number of nonrelevant documents

Q1 therefore is the vector sum of the original query plus the vectors of the relevant and nonrelevant documents. Using this as the basic relation, he suggested possible additional constraints such as extra weighting for the original query, or limits to the amount of feedback from nonrelevant documents. He ran a variation of the basic formula on a very small test collection, constraining the feedback by only allowing terms to be in Q1 if they either were in Q0 or occurred in at least half the relevant documents and in more relevant than nonrelevant documents. The results of these experiments were very positive. In 1969 Ide published a thesis based on a series of experiments extending Rocchio's work, again using the SMART system (Ide 1971). She not only verified the positive results of Rocchio, but developed three particular strategies that seemed to be the most successful. The first was the basic Rocchio formula, minus the normalization for the number of relevant and nonrelevant documents. The second strategy was similar, but allowed only feedback from relevant documents, and the third strategy (Ide dec hi) allowed limited negative feedback from only the highest-ranked nonrelevant document. She found no major significant difference between performance of the three strategies on average, but found that the relevant only strategies worked best for some queries, and other queries did better using negative feedback in addition. Although the early SMART work was done with small collections, it provided clear evidence of the potential power of this technique. Additionally it defined some of the major problems in the area of evaluation (Hall and Weiderman 1967; Ide 1969). These are discussed in the next section because the evaluation problem has continued to plague feedback research, with the use of inappropriate evaluation

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (3 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

procedures often indicating performance gains using feedback that are unrealistic.

11.2.2 Evaluation of Relevance Feedback
Standard evaluation in information retrieval (Salton and McGill 1983) compares recall-precision figures generated from averaging the performance in individual queries and comparing this averaged performance across different retrieval techniques. If this evaluation method is used in a simplistic manner in comparing the results after one iteration of feedback against those using no feedback, the results generally show spectacular improvement. Unfortunately, a significant part of this improvement results from the relevant documents used to reweight the query terms moving to higher ranks (e.g., documents initially ranked at 1, 4, and 8 moving to ranks 1, 2, and 4). Not only is this an unrealistic evaluation because presumably the user has already seen these documents, but it also masks any real improvement in relevant document ranks below those initially shown the user. There were several more realistic methods of evaluation tried (Salton 1970) but a de facto standard (the residual collection method) has since been used for most research. In this method the initial run is made and the user is shown the top x documents. These documents are then used for relevance feedback purposes. The evaluation of the results compares only the residual collections, that is, the initial run is remade minus the documents previously shown the user and this is compared with the feedback run minus the same documents. This method provides an unbiased and more realistic evaluation of feedback. However, because highly ranked relevant documents have been removed from the residual collection, the recall-precision figures are generally lower than those for standard evaluation methods and cannot be directly compared with performance as measured by the standard evaluation method.

11.2.3 Research in Term Reweighting without Query Expansion
The probabilistic model proposed by Robertson and Sparck Jones (1976) is based on the distribution of query terms in relevant and nonrelevant documents (for more on the probabilistic model, see Chapter 14). They developed a formula for query term-weighting that relies entirely on this distribution.

where Wij = the term weight for term i in query j r = the number of relevant documents for query j having term i R = the total number of relevant documents for query j n = the number of documents in the collection having term i
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (4 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

N = the number of documents in the collection Note that this weighting method assumes that all relevant documents are known before a query is submitted, a situation not in itself realistic, but suggestive of a method of relevance feedback after some knowledge is gained about relevance (especially for use in SDI services). Robertson and Sparck Jones used this formula to get significantly better retrieval performance for the manually indexed Cranfield 1400 collection, employing an acceptable alternative method of evaluation (the test and control method). Sparck Jones (1979a) extended this experiment to a larger collection and again showed significant improvements. She also performed an experiment (1979b) to simulate the use of this relevance weighting formula in an operational relevance feedback situation in which a user sees only a few relevant documents in the initial set of retrieved documents, and those few documents are the only ones available to the weighting scheme. These experiments required adding a constant to the above formula to handle situations in which query terms appeared in none of the retrieved relevant documents. The results from this reweighting with only a few relevant documents still showed significant performance improvements over weighting using only the IDF measure (Sparck Jones 1972; see Chapter 14 on Ranking Algorithms for a definition of IDF), indicating that the probabilistic weighting schemes provide a useful method for relevance feedback, especially in the area of term reweighting. Croft (1983) extended this weighting scheme both by suggesting effective initial search methods (Croft and Harper 1979) using probabilistic indexing and by adapting the weighting using the probabilistic model to handle within-document frequency weights.

where Wijk = the term weight for term i in query j and document k IDF1 = the IDF weight for term i in the entire collection pij = the probability that term i is assigned within the set of relevant documents for query j

qij = the probability that term i is assigned within the set of

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (5 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

nonrelevant documents for query j

r = the number of relevant documents for query j having term i R = the total number of relevant documents for query j n = the number of documents in the collection having term i N = the number of documents in the collection

where freqik = the frequency of term i in document k maxfreqk = the maximum frequency of any term in document k Two constants, C and K, allow this scheme to be adjusted to handle different types of data. Croft and Harper (1979) found that for the manually indexed Cranfield collection, C needed to dominate the initial search because the mere assignment of an index term to a document implied the importance of that term to that document. For automatically indexed collections, the importance of a term is better measured by the IDF alone, with C set to 0 for the initial search. Croft (1983) found that the best value for K was 0.3 for the automatically indexed Cranfield collection, and 0.5 for the NPL collection, confirming that within-document term frequency plays a much smaller role in the NPL collection with its short documents having few repeating terms. When K is used in the subsequent feedback runs, the optimum value increases to 0.5 for Cranfield and 0.7 for NPL. This could reflect the improved weighting for term importance within an entire collection using the relevance weighting instead of the IDF, requiring a lesser role for within-document frequency weighting. The results of this new weighting scheme with optimal C and K settings showed improvements in the automatically indexed Cranfield collection of up to 35 percent over weighting using only the term distribution in the initial set of relevant and nonrelevant documents, although this improvement was somewhat less using the NPL collection where documents are shorter. It is interesting to note the parallel between the best methods of initial searching and the results of adding within-document frequency weighting to the probabilistic weighting. In the initial searching it was shown (see Chapter 14) that it is important in most situations to weight the terms both by a measure of their importance in a collection (such as using the IDF measure) and by a measure of their importance
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (6 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

within a given document. Harman (1986) suggested that these two measures are complementary and that combining them provided improvements roughly analogous to adding the improvements found by using each measure separately. Probabilistic weighting affects only the importance of a term within a collection for a given query; the importance of that term within a given document can only be measured by some function of its frequency within that document, and therefore it should be expected that including both weighting components produces the best results. Other models of probabilistic indexing have been proposed (or re-evaluated) recently; in particular see Fuhr (1989) and Wong and Yao (1990).

11.2.4 Query Expansion without Term Reweighting
If a query has retrieved no relevant documents or if the terms in the query do not match the terms in the relevant documents not yet retrieved, it becomes critical to expand the query. The early SMART experiments both expanded the query and reweighted the query terms by adding the vectors of the relevant and nonrelevant documents. However, it is possible to expand a query without reweighting terms. Ideally, query expansion should be done using a thesaurus that adds synonyms, broader terms, and other appropriate words. The manually constructed thesaurus needed for this is seldom available, however, and many attempts have been made to automatically create one. Most of these involve term-term associations or clustering techniques. As these techniques are discussed in Chapter 16 on clustering, only the results will be briefly described here. Lesk (1969) tried several variations of term-term clustering using the SMART system and had little success. The fact that term co-occurrence is used to generate these thesauri implies that most terms entering the thesaurus occur in the same documents as the seed or initial term (especially in small collections), and this causes the same set of documents to be retrieved using these related terms, only with a higher rank. Not only does this cause little overall improvement, but that improvement is only in the precision of the search rather than in increasing recall. Sparck Jones and Barber (1971) also tried term-term clustering. Again they did not get much improvement and found that it was critical to limit the query expansion to only those terms strongly connected to initial query terms, with no expansion of high-frequency terms, to avoid any degradation in performance. These results were confirmed by Minker, Wilson, and Zimmerman (1972) who found no significant improvement in relevant document retrieval, with degradations in performance often occurring. Harman (1988) used a variation of term-term association (the nearest neighbor technique) and found similar results, with no improvement and with less degradation when adding only the top connected terms and no high-frequency terms. She suggested that users could "filter" the new query terms, that is, show users a list of suggested additional query terms and ask them to select appropriate ones. In a
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (7 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

simulation assuming a perfect user selection, however, the improvements were only 8.7 percent using the automatically indexed Cranfield collection. By comparison, applying a similar user filtering process with a selection of terms from relevant documents provided an improvement of over 16 percent, almost twice that for term-term clustering techniques. Because the area of automatic thesaurus building is so intuitively appealing, work still continues. Crouch (1988) presented preliminary results of another term-term correlation study, and Guntzer et al. (1989) is currently working on a semi-automatic method to construct a thesaurus based on actual term usage in queries. Two additional promising methods involve the use of highly structured networks. Belew (1989) described a rich connectionist representation which was modified based on retrieval experience, and Dumais (1990) used an elaborate factor analysis method called latent semantic indexing to expand queries. It is also possible to provide query expansion using terms from the relevant documents without term reweighting. As mentioned earlier, Harman (1988) produced lists of terms from relevant documents using several methods of ranking those terms. These terms were then automatically added to the query. She found significant differences between the ranking methods, with techniques involving a combination of the normalized noise measure (similar to the IDF; see Chapter 14 on Ranking Algorithms) with the total frequency of the term within the set of relevant documents outperforming simple sorts based only on the number of relevant documents containing the term. She also found that adding too many terms from a sorted list decreased performance, with the best performance for the Cranfield collection occurring after adding about 20 terms, and slightly worse performance both for higher than 20 terms and for only 10 terms. As a third part of the experiment, it was shown that using a simulated perfect user selection from the top 20 terms produced a 31 percent improvement over methods with no user selection. Several on-line retrieval systems have also used this approach of showing users sorted term lists (see section 11.3).

11.2.5 Query Expansion with Term Reweighting
The vast amount of relevance feedback and query expansion research has been done using both query expansion and term-reweighting. The early SMART experiments (see section 11.2.1) added term vectors to effectively reweight the query terms and to expand the query. Researchers using probabilistic indexing techniques for reweighting also tried some specific experiments in query expansion. Harper and van Rijsbergen (1978) used the probabilistic model to reweight query terms, using all relevant document information for weighting, and expanding the query by adding all terms directly connected to a query term by a maximum spanning tree (MST) technique of term-term clustering. Additionally, they developed a new relevance weighting scheme, the EMIM weighting measure (see Harper 1980 for implementation details of EMIM). This weighting scheme better handles the parameter estimation problem seen in the Robertson-Sparck Jones measure (1976) when a query term appears in no retrieved relevant documents, although it is not as theoretically optimal as the Robertson-Sparck Jones measure. The EMIM weighting scheme with query expansion by MST showed significant improvements
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (8 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

(using all relevant documents) over the formula used by Robertson and Sparck Jones. Using only 10 or 20 documents for feedback, the EMIM reweighting with expansion still showed significant improvements over the old formula. It should be noted, however, that two effects are compounded in this experiment: reweighting using an improved weighting scheme (EMIM) and query expansion using a MST technique. Harper's thesis (1980) elaborated on the area of query expansion and reweighting. Using the EMIM relevance weighting scheme, he tried the same MST expansion on the manually indexed Cranfield 1400 collection, a Cranfield 1400 collection with only titles for documents, and the larger UKCIS2 collection. The improvements for query expansion were much less with the Cranfield titles and the UKCIS2 collection than for the manually indexed Cranfield collection, largely because the MST had much shorter documents to use for expansion. In addition to the expansion using the MST, Harper tried expanding queries using a selection of terms from retrieved relevant documents. He selected these terms by ranking a union of all terms in the retrieved relevant documents using the EMIM measure, and then selecting a given number from the top of this list. He found significant performance improvements for the manually indexed Cranfield collection over that using the MST expansion, but no improvement for the UKCIS2 collection. Performance improvements for query expansion seem to be heavily dependent on the test collection being used. The Cranfield collection, either as manually indexed, or with automatically indexed abstracts, consistently shows performance improvements for query expansion, either using the MST method or expanding by terms from relevant documents. Collections with short documents such as the Cranfield title collection or the UKCIS collection probably do not provide enough terms to make expansion effective. A further example of this effect is shown in Smeaton and van Rijsbergen (1983), where a very extensive number of different query expansion methods showed no improvements when run using the NPL test collection, a collection having documents indexed by only 20 terms apiece on the average. Wu and Salton (1981) experimented with term relevance weighting (also known as term precision weighting), a method for reweighting using relevance feedback that is very similar to the RobertsonSparck Jones formula. They tried reweighting and expanding the query by all the terms from relevant documents. They found a 27 percent improvement in average precision for the small (424 document) Cranfield collection using reweighting alone, with an increase up to 32.7 percent when query expansion was added to reweighting. There was more improvement with the Medlars 450 collection, with 35.6 percent improvement for reweighting alone and 61.4 percent for both reweighting and expansion. A recent paper by Salton and Buckley (1990) compared twelve different feedback procedures, involving two different levels of query expansion, across six different test collections. The standard three-vector feedback methods were used, in addition to other probabilistic feedback methods. Concentrating only on the vector feedback methods, the three they used were as follows.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo...Books_Algorithms_Collection2ed/books/book5/chap11.htm (9 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

where Q0 = the vector for the initial query Ri = the vector for relevant document i Si = the vector for nonrelevant document i n1 = the number of relevant documents n2 = the number of nonrelevant documents The basic operational procedure in these three methods is the merging of document vectors and original query vectors. This automatically reweights query terms by adding the weights from the actual occurrence of those query terms in the relevant documents, and subtracting the weights of those terms occurring in the nonrelevant documents. Queries are automatically expanded by adding all the terms not in the original query that are in the relevant documents and nonrelevant documents, using both positive and negative weights based on whether the terms are coming from relevant or nonrelevant documents (no new terms are actually added with negative weights; the contribution of nonrelevant document terms is to modify the weighting of new terms coming from relevant documents). The Ide dec-hi method only uses the top nonrelevant document for feedback, instead of all nonrelevant documents retrieved within the first set shown the user. The Rocchio method both allows adjustment for the relative input of relevant and nonrelevant documents and uses a weighting scheme based on a normalized version of the document weights rather than the actual document weights themselves. The results favored the Ide dec-hi method for all six collections, although the other methods follow closely behind. Setting = 0.25 and = 0.75 for the Rocchio method created the best results, limiting the effects of negative feedback. These results confirm earlier work on the smaller collections by the SMART project in which Ide found no major significant difference between performance of the three strategies on average, but found that the "relevant only" strategies worked best for some queries, and other queries did better using negative feedback in addition. Two types of document weighting were used in these experiments: a binary weighting and weighting done using the best SMART weighting schemes involving a combination of within-document frequencies of terms and the IDF measure for these terms. It was concluded that using appropriate term weighting for the documents (as opposed to binary weighting schemes) is important for most collections. This agrees with Croft's results (1983) using within-document frequencies in addition to reweighting by term distribution in relevant and nonrelevant documents. They also found differences in performance across test collections and suggested that the types of data to be used in retrieval should be examined in
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap11.htm (10 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

selecting feedback and query expansion mechanisms (see Section 4 on Recommendations for Use of Relevance Feedback).

11.2.6 Other Research in Relevance Feedback
Noreault (1979) tried an entirely different approach to relevance feedback. After users had selected a relevant document(s), he used this (these) document(s) as a "new query," effectively ranking (using the cosine correlation as implemented in the SIRE system) all the documents in the collection against this document(s). The top 30 of these retrieved documents were then added to those initially selected by the original query. He found on average that this added 5.1 new relevant documents per query. This feedback method would work with any type of retrieval system such as Boolean systems, provided some method existed for selecting related documents, as the query itself is not modified during the retrieval process. Attar and Fraenkel (1981) used an approach based on local feedback only. They produced an ordered list of the terms in all the documents retrieved in a first iteration search, with the list ordered by the frequency of a term in that set and by its "distance" from the initial query. These terms were shown to a user for selection as new query terms or sifted by an automatic procedure for addition to the query. They found that both expert and nonexpert users could correctly select terms from the suggested list, but that their automatic sifting mechanism could not do this well. Note that terms are selected from all retrieved documents, not just relevant ones, so that this technique could be used as a feedback technique for queries retrieving no relevant documents on the first iteration. Dillon et al. (1983) used a hybrid approach to provide relevance feedback to a Boolean environment. They devised a new query based only on terms from previously retrieved documents, with terms weighted by a formula similar to the term precision formula used by Salton (but with significant differences). The weighted terms were not used, however, to produce a ranked list of retrieved documents, but were used to automatically construct a revised Boolean query. Results suggest that this method can be effective if very careful attention is paid to the construction of the revised Boolean query.

11.3 USE OF RELEVANCE FEEDBACK IN ONLINE RETRIEVAL SYSTEMS
One of the earliest (and few) applications of relevance feedback in an operational system was by Vernimb (1977) in the European Nuclear Documentation System (ENDS). This system has an automatic procedure based on a user/system dialogue, removing the need for a user to deal with a Boolean retrieval system. The starting position for this approach is at least two relevant documents, either obtained by a manual search, or by any type of search using a Boolean or ranking retrieval system. The ENDS system then automatically constructs a Boolean query using an elaborate procedure based on the co-occurrence of terms in these known relevant documents, and the user is shown a new set of documents for another

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap11.htm (11 of 24)7/3/2004 4:20:48 PM

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY

iteration of relevance judgments and automatic query adjustments. This procedure is further modified by a term-weighting scheme based on the SMART model that allows the list of documents shown the user to be ranked based on occurrences of terms in the set of relevant documents versus the set of nonrelevant documents. The users of this system found it very satisfactory, and some limited experimental results showed that its performance was considerably better than traditional procedures (such as traditional Boolean retrieval). Another type of man-machine dialogue using relevance feedback was developed by Oddy (1977) in his THOMAS system. Although this system was purely experimental, it was designed to be a prototype system for the end-user and therefore resembles an on-line retrieval system more than a system designed for retrieval experiments. In this system the user inputs a natural language phrase or sentence and the system returns a single reference based on a search in the concept network (a network built by the MEDUSA system at the University of Newcastle for access to MEDLARS records). This reference contains both a list of authors of the reference and a list of index terms associated with that reference. The user is asked to not only judge the relevance of that reference, but to select index terms that are relevant (or optionally insert new index terms). The query is appropriately modified by the system and a new reference is sought. This type of query modification allows users to browse the references and, although it is not as automatic as the ENDS system, it gives a user more explicit control over the browsing process. The CUPID system (Porter 1982) implemented the probabilistic retrieval model at the University of Cambridge. This system takes a natural language query, returns a ranked list of titles, and asks the user for relevance judgments. These judgments are used for two distinct purposes: (1) to reweight the original query terms using the Robertson-Sparck Jones reweighting formula, and (2) to select a ranked set of terms from the relevant documents to submit to the user for possible addition to the query. This process can be iterated repeatedly until the user has sufficient information. An improved version of CUPID, the MUSCAT system, is currently running at the Scott Polar Research Institute, Cambridge (Porter and Galpin 1988). Doszkocs's CITE system (1982) was designed as an interface to MEDLINE and was used as an on-line catalog system (Doszkocs 1983). It combined a ranking retrieval method (see Chapter 14 for more on the CITE system) with an effective search strategy for the huge document files and the controlled and full text vocabulary used in MEDLINE. CITE returned a list of ranked references based on a user's natural language query, and asked for relevance judgments from the user. The system used these judgments to display a ranked list of medical subject headings (ranked by a frequency analysis of their use in the retrieved documents). The user then selected terms for adding to the query and the search was repeated. Cirt, a front end to a standard Boolean retrieval system, uses term-weighting, ranking, and relevance feedback (Robertson et al. 1986). The system accepts lists of terms without Boolean syntax and converts these terms into alternative Boolean searches for searching on the Boolean system. The user is then shown sets of documents, defined by Boolean combinations of terms, in rank order as specified by a special weighting and ranking algorithm (Bovey and Robertson 1984). If relevance judgments are made,
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap11.htm (12 of 24)7/3/2004 4:20:48 PM

Relevance feedback plays a crucial role in the Connection Machine retrieval system (Stanfill and Kahle 1986). and displays these in order of highest frequency of occurrence for user selection. even though the users thought they had retrieved most of the relevant documents (and felt they needed all the relevant documents). Some users may never be interested in a high-recall search. The first approach. It analyzes the frequency of single words. An experiment by Fox (1987) showed little use of relevance feedback in a student experiment.1 Relevance Feedback Methodologies It should be stated again that relevance feedback and/or query modification are not necessary for many queries. only about 20 percent of the relevant documents were retrieved. the Associative Interactive Dictionary on the TOXLINE system (Doszkocs 1978). and may possibly be of marginal use in some operational systems. and displays a list of terms ranked on these statistics. the ZOOM facility of ESA-IRS Online search (Ingwersen 1984). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY feedback takes the form of appropriately reweighting terms to produce different ranked orders of documents. If a decision is made to offer relevance feedback or some type of query expansion technique. causing very large queries that can take advantage of the speed offered by the massively parallel architecture.4 RECOMMENDATIONS FOR USE OF RELEVANCE FEEDBACK 11. Here a query is automatically expanded by adding all the terms in the relevant documents. others may not be willing to spend any extra effort in retrieving more relevant documents or may be unaware of the low recall of their initial search. In a test of a large operational full-text retrieval system (Blair and Maron 1985).. The second.4.ooks_Algorithms_Collection2ed/books/book5/chap11. 11. This system relies heavily on co-occurrence of terms and therefore needs large numbers of query terms to work effectively. or codes appearing in the set of documents. phrases. or in a single good bibliographic reference for introduction to a new area of interest. Some situations call for high recall. however. then the appropriate method depends on two issues: the type of retrieval system being used for initial searching. Often users are only interested in an "answer" to their question (such as a paragraph in an online manual). generalizes the first approach to deal with a complete set of retrieved documents. Two approaches for query expansion produce lists of related terms in ranked order for user selection and addition to the query. does a statistical analysis of the terms appearing in relevant documents in comparison to their appearance in all documents. are of marginal use to many users.. probably because it was "not explained adequately" and because the nine test queries did not need further modification.htm (13 of 24)7/3/2004 4:20:48 PM .

Vernimb.. and systems based on ranking using either an ad hoc combination of term-weighting schemes or using the probabilistic indexing methods.ooks_Algorithms_Collection2ed/books/book5/chap11. and the second method would help a user zero in on new terms by narrowing the term list to only those terms in the relevant documents. systems based on ranking using a vector space model. The second option would be to produce ranked lists of terms for user selection (such as the ZOOM feature on ERA-IRS [Ingwersen 1984]). they offer the greatest improvement in performance. This option leaves the Boolean mechanism intact. [Radecki 1988]) require major modifications to the entire system and can be difficult to tune. Experiments showed that the Ide dec-hi method is the best general purpose feedback technique for the vector space model. The first method would be useful in helping a user construct a better query by pointing out the term distribution in the entire retrieved document set (relevant and nonrelevant). Front-end construction or specially modified Boolean systems (see again the special issue of Information Processing and Management. two options seem feasible. then the feedback methods developed in the SMART project are appropriate. but improves the query by suggesting alternative terms and by showing the user the term distribution in the retrieved set of documents. Feedback in Boolean retrieval systems If a Boolean retrieval system is being used. The ranking of the list could be done by term frequency within the entire retrieved set (the simplest method) or by more complex methods taking account of term distribution within only the relevant documents (such as Doszkocs's AID system [Doszkocs 1978]). However. or to graft a ranking system onto the Boolean system to handle all but the initial query (such as done by Noreault). where file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. the vector space model automatically combines the two components of feedback. two characteristics have been found to be important experimentally: the length of the documents (short or not short). making implementation of relevance feedback more straightforward using this method but somewhat less flexible. Feedback in retrieval systems based on the vector space model If the retrieval system is based on the vector space model using a cosine correlation as a ranking measure. and the type of indexing (controlled or full text). The most difficult option would be to modify the system to specially handle relevance feedback either by designing frontends to automatically construct modified Boolean queries (such as was done by Dillon. Note that the effects of the type of data being used for retrieval are minimal for this type of feedback as user selection is providing the term filter rather than automatic term distribution statistics. term reweighting and query expansion.. and the Cirt system). In terms of data characteristics. As discussed earlier. It uses minimal negative feedback and can be implemented without concern for deciding parameter settings. Possibly other factors exist. Three basic retrieval systems are being addressed here: Booleanbased systems. but there is a lack of direct experimentation about these issues.htm (14 of 24)7/3/2004 4:20:48 PM .Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY and the type of data that is being used.

Feedback in retrieval systems based on other types of statistical ranking If the retrieval system is based on either the probabilistic indexing model or an ad hoc combination of term-weighting schemes.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY Q0 = the vector for the initial query Ri = the vector for relevant document i S = the vector for top nonrelevant document The use of normalized vector space document weighting is highly recommended for most data rather than binary weighting schemes (see section 14. The term-weighting schemes should be carefully examined to determine the correct method for implementing reweighting based on relevance feedback. and the number of terms added is the average number of terms in the retrieved relevant documents. This means that whereas all query terms will be reweighted during query modification. the terms to be added to the query are automatically pulled from a sorted list of new terms taken from relevant documents.htm (15 of 24)7/3/2004 4:20:48 PM . the reweighting schemes only affect the weighting based on the importance of a term within an entire collection (global importance). such as those based on normalized within-document frequencies) should not be changed.. and that those be chosen based on the criteria presented in that paper. with no specific relationship.ooks_Algorithms_Collection2ed/books/book5/chap11.5 of Chapter 14 for details on these weighting schemes). Term-weighting based on term importance within a given document (local importance. where Wijk = the term weight for term i in query j and document k file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. then the term-weighting and query expansion can be viewed as two separate components of feedback. A possible alternative to this selection method would be to substitute user selection from the top of the list rather than adding a fixed number of terms. As noted earlier. Since results suggest that query expansion using all terms from the retrieved relevant documents may be only slightly better than a selection of those terms (Salton and Buckley 1990). nor should term-weightings based on document structure or other such considerations..2. The only decision to be made using this type of relevance feedback is how much query expansion to allow. in which he simply replaced the portion of the term-weighting based on global term importance (the IDF in this case) with a revised version of the Robertson-Jones weighting formula. These terms are sorted by their total frequency within all retrieved relevant documents. it is recommended that only a limited number of terms be added (mainly to improve response time). The approach taken by Croft (1983) and described earlier in section 11.3 is the recommended model to follow in reweighting methodology.

This allows the withindocument frequency to play a large role in the initial search.5 for feedback searches. The setting of the constant K should be around 0.. Based on the experiments previously reported..3 for the initial search of regular length documents (i. This allows the IDF or the relevance weighting to be the dominant factor. rising to around 0. and a somewhat diminished role in file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. documents having many multiple occurrences of a term). C and K. a reasonable estimate for C would be 0 for automatically indexed collections or for feedback searching..htm (16 of 24)7/3/2004 4:20:48 PM .ooks_Algorithms_Collection2ed/books/book5/chap11. provide methods for adjusting the weighting scheme for different types of data. Only for manually indexed collections should C be set higher to allow the mere existence of a term within a document to carry more weight.e.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY IDFi = the IDF weight for term i in the entire collection pij = the probability that term i is assigned within the set of relevant documents for query j qij = the probability that term i is assigned within the set of nonrelevant documents for query j r = the number of relevant documents for query j having term i R = the total number of relevant documents for query j n = the number of documents in the collection having term i N = the number of documents in the collection where freqik = the frequency of term i in document k maxfreqk = the maximum frequency of any term in document k Two adjustable constants.

User selection becomes even more critical in this situation. where it is assumed that the relevance weighting becomes more and more important. Whereas user selection is strongly recommended. This will depend on the relative importance of these additional weighting schemes with respect to the importance of matching a given keyword. and important grouping of those terms (phrases). A somewhat different approach to user term selection is to get more initial input from the user. such as possible related terms. If other weighting schemes based on criteria such as document structure have been used. then it is likely that the overall importance of these schemes in the total weighting needs to be modified in a manner similar to the downweighting of document frequency weights using the constant K.5 to 0. with the list sorted on the total frequency of the given term within the entire set of retrieved relevant documents (possibly modified by the IDF or normalized noise measure of the term). or adding the closest related term using the automatic thesaurus method for each input query term (see Harman 1992). To be truly effective. as it has been shown in the past that these expansion methods are unreliable for short documents. and will require some experimentation in order to properly introduce relevance feedback to these schemes. the within-document frequency plays a minimal role and should be either removed from the scheme (such as suggested by Salton and Buckley [1990]) or downweighted by setting K from 0. The weighting scheme above only affects the ranks of documents containing terms originally in the query. relative importance of query terms. The second alternative method is therefore to present users with a sorted list of terms from the relevant documents. Although this expansion method works even in situations with no relevant documents retrieved. with users selecting appropriate terms for addition to the query. grouped by expanded terms for each query term. These terms should be in a single list. For short documents. it would be possible to use either of these query expansion techniques automatically. offering users a selection of terms that are the terms most closely related to the individual query terms.ooks_Algorithms_Collection2ed/books/book5/chap11.htm (17 of 24)7/3/2004 4:20:48 PM . such as titles or very short abstracts. the query could be expanded by maximal spanning tree or nearest neighbor techniques (or other automatic thesaurus methods).Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY subsequent searches. First. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Two alternative sources exist for new query terms. with users selecting appropriate terms for addition to the query. Care should be taken when using either of the two query expansion methods on collections of short documents such as ones with less than 25 index terms. it has been shown that query expansion by related terms is not as effective as expansion by terms from relevant documents.7.. adding either the top 20 or so terms from a sorted list of terms from the relevant documents. some type of query expansion needs to be done.. Croft and Das (1990) showed that using this technique in combination with relevance feedback with user selected terms improved results even more than feedback alone. with only a 35 percent overlap in terms found in relevant documents with those suggested by the users initially.

rather than descriptions of actual implementations that have been fully tested. Even a slight response time delay for feedback may be preferable to the high overhead of storage needed for the lists of terms in a per document order.5.4. but not the terms contained in each document. For small collections. Assuming that these term lists are available.ooks_Algorithms_Collection2ed/books/book5/chap11. with no lists being kept of all the significant words in each document.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY 11. with discussions of alternative methods of meeting these needs. keeping track of the total number of postings (documents containing one or more occurrences of the term). and the total frequency of terms within all relevant documents. a structure needs to be built to merge a list of all terms in relevant documents.6 of Chapter 14 to include relevance feedback and other query modification. most of the retrieval systems described in this chapter and Chapter 14 involve the use of inverted files for fast response time. an alternative method would be to parse the retrieved documents in the background while users are looking at the document titles..1 Data and Structure Requirements for Relevance Feedback and Query Modification The major data needed by relevance feedback and other query modification techniques is a list of the terms contained in each retrieved document. 11. For larger data sets. This means that the index lists the documents containing each term. Unfortunately. If negative feedback is used such as in Rocchio's algorithm.. and the second part of the section contains a proposal expanding the basic ranking system described in section 14.htm (18 of 24)7/3/2004 4:20:48 PM . then an additional list needs to be kept for nonrelevant documents.5 SOME THOUGHTS ON THE EFFICIENT IMPLEMENTATION OF RELEVANCE FEEDBACK OR QUERY MODIFICATION Because of the lack of use of relevance feedback in operational systems in general. either through past storage or through on-line parsing. These lists could be implemented as sets file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. this inverted file is the only supplemental file stored because of space issues. For large data sets. These terms become input to structures designed for reweighting of query terms and for collecting lists of terms for query expansion. Although this may be time consuming. The input text is kept for display purposes only. This section therefore contains only guidelines for efficient implementations. the prototype experience of Harman and Candela (1990) using a background secondary string search on retrieved records indicates that this additional work can be accomplished with minimal effect on response time. little work has been done in achieving efficient implementations for large data sets of the recommended feedback algorithms discussed in section 11. The first part of the section lists the data and data structure needs for relevance feedback. lists of the terms within each document can be kept (see Porter [1988] for a description of the necessary structures used in the MUSCAT system).

he or she would be asked to mark retrieved documents that are relevant and to indicate to the system the last document scanned during retrieval.7. The inverted file as shown in Figure 11. Figure 11.ooks_Algorithms_Collection2ed/books/book5/chap11. These modifications allow query expansion using some type of automatic thesaurus such as a nearest neighbor algorithm.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY of sorted lists using some type of binary search and insert algorithm.3 in Chapter 14. then additional storage may be needed.1: Dictionary and Postings File with Related Terms The use of the full terms (as compared to stems only) is necessary because of the addition of the related terms.1 of Chapter 14 for an example of such a system and form a natural extension of Figure 14. The file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. with two modifications: the terms are complete words rather than stems.. See section 14. After a ranked list of documents are retrieved. there could be extensive structures needed for more sophisticated methods using complex interrelationships. This storage could be minimized by attaching the closest related term to each term in the dictionary portion of the inverted file (or the closest two or three terms). The searching itself would proceed as described in the basic retrieval system. Suggestions for the use of this structure are given in section 11.6 of that chapter. The inverted file would be constructed exactly as described in section 14.htm (19 of 24)7/3/2004 4:20:48 PM . 11. with the two related terms being added using some type of term-term relating algorithm (see Chapter 16 on clustering for details of these algorithms).5. If some type of automatic thesaurus method is to be used in query expansion.5.2.. It should be made clear that the modifications for feedback suggested in this section have not been implemented and therefore should be used only as an illustration of implementations rather than as thoroughly tested techniques.6 of Chapter 14. the issues discussed in this section apply to any efficient implementation of relevance feedback. and there are two additional fields holding closely related terms.5 of that chapter to include related terms. If the user selects to start relevance feedback. Stems could be used if some extension to the basic retrieval system to permit the use of both stems and full terms were in effect. Nevertheless. users would have an option of doing relevance feedback rather than submitting a new query.2 A Proposal for an Efficient Implementation of Relevance Feedback This proposal is based on the implementation of the ranking system as described in section 14. Alternatively. or using an efficient binary tree method such as the right-threaded tree used in the indexing method described in Chapter 3 on Inverted Files.1 is the same as that shown in Figure 14. This structure is the major data structure needed for query term reweighting and for production of lists of sorted terms for query expansion.

"An Evaluation of the Retrieval Effectiveness of a Full-Text Document-Retrieval System. 1985. the system recomputes the weights of the original query terms based on the reweighting schemes recommended in section 11." Information Processing and Management. If no relevant documents are indicated. BLAIR. and A. After the user indicates that all relevant documents have been marked.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY system would start building the merged list of terms previously described in the background as soon as the marking starts. 1981. More experimentation in the use of relevance feedback in operational systems is needed to verify the guidelines in this proposal. R. "Experiments in Local Metrical Feedback in Full-Text Retrieval Systems. Relevance feedback has clearly proven itself in the laboratory and now needs only efficient implementations to hasten its use in operational systems. Two further modifications are necessary to the basic searching routine to allow feedback. and the term list shown to the user is the list of related terms based on those previously stored in the inverted file. C. The expanded and reweighted query is run as if it were an initial query. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. Recommendations and guidelines for building relevance feedback components have been given..4. Additionally. REFERENCES ATTAR.ooks_Algorithms_Collection2ed/books/book5/chap11. the query is expanded by this merged term list.7. 11. S." Communications of the ACM. no reweighting is done.htm (20 of 24)7/3/2004 4:20:48 PM .6 SUMMARY This chapter has presented a survey of relevance feedback techniques both for reweighting query terms and for adding new query terms. This permits the IDF to be pulled from the inverted file on the initial search and combined with the normalized document frequency weight. 17(3). As the number of query terms grows large. either by automatically adding the terms with the highest "rank" in the list (again see section 11. 28(3). the weights stored in the postings must be only the normalized document frequency weights (option 2 in section 14.6. but all documents marked as scanned by the user do not appear in the revised retrieved list. Some of the more elaborate pruning methods described in that section may be even more appropriate as they eliminate even the need to retrieve some documents. The basic retrieval algorithm is time-dependent on the number of query terms. E.. The second modification is not strictly necessary but probably would ensure adequate response time. MARON. 289-311. 115-26.5 of Chapter 14 become critical to good performance for long queries. and M. FRAENKEL. the number of retrieved documents also grows large and the final sort for the list of documents becomes very time-consuming. The pruning modifications described in section 14..1 of Chapter 14) rather than the combined weighting using the normalized document frequency weights and the IDF. First. but allows the new relevance weights to be combined with the normalized document frequency weights on subsequent searches. D.4 for recommended methods of sorting the list) or preferably by showing the user the sorted list and adding only those terms selected by the user.

T. ULMSCHNEIDER. and D. and S.. Mass. J. 1983. 1983. D.. G. 1990. 1989. Grenoble. and J. 25(1). K. Salton and H. 3(1). 1979. "Testing the Applicability of Intelligent Methods for Information Retrieval." Unpublished manuscript. W. U. A. "Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval. 19(1)." Paper presented at ACM Conference on Research and Development in Information Retrieval. BOVEY. E. T. S. N. ROBERTSON. 119-38.htm (21 of 24)7/3/2004 4:20:48 PM . 1984. 84-87. Journal of Documentation. E. CROFT. B. "AID. France. eds." Information Processing and Management. DESPER. Using Probabilistic Models of Document Retrieval Without Relevance Information. G.ooks_Algorithms_Collection2ed/books/book5/chap11. 1989. DOSZKOCS. 27-36. 251-62.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY BELEW.. J." Information Technology: Research and Development. T. 1990. CROUCH. "Experiments with Representation in a Document Retrieval System. and F. DAS. Boston. "Experiments in Query Acquistion and Use in Document Retrieval Systems. 285-95. J. 1988. GUNTZER. CROFT. DOSZKOCS. SARRE. 1-21. 35(4).. B. SEEGMULLER. W. B. and R." Online Review 2(2). FUHR. W.. "Automatic Thesaurus file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." Information Processing and Management. "Models for Retrieval with Probabilistic Indexing. an Associative Interactive Dictionary for Online Searching. "A Prevalence Formula for Automatic Relevance Feedback in Boolean Systems. E. C. J. Belgium. CROFT. 2(1). 55-72. R. Brussels. 163-74. 1978. Schneider. "A Cluster-Based Approach to Thesaurus Construction. G. DILLON. 1982. M. Berlin: Springer.. J.. 1987. DUMAIS. pp. JUTTNER. "An Algorithm for Weighted Searching on a Boolean System. E. HARPER. 1989. "Adaptive Information Retrieval: Using a Connectionist Representation to Retrieve and Learn About Documents." Paper presented at ACM Conference on Research and Development in Information Retrieval." Paper presented at 13th International Conference on Research and Development in Information Retrieval." Information Technology: Research and Development." in Research and Development in Information Retrieval. FOX." Information Services & Use 7. "From Research to Application: The CITE Natural Language Information Retrieval System.

Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY Construction by Machine Learning from Retrieval Sessions. ISR-15 to National Science Foundation from Department of Computer Science. ISR-12 to the National Science Foundation from Department of Computer Science." Journal of Documentation. D. 25(3). "New Experiments in Relevance Feedback. IDE." Report No. 465-92. 1992. 34(3). 1969. H. 1990. and C. England. "Relevance Feedback in an Automatic Document Retrieval System." Online Review 8(5). W. E. Doctoral dissertation.. ed.. "An Evaluation of Feedback in Document Retrieval Using Co-Occurrence Data. G. D. Grenoble. 189-216. Cornell University. M. E. D. 8-36. Englewood Cliffs. 1971.htm (22 of 24)7/3/2004 4:20:48 PM ." Paper presented at ACM Conference on Research and Development in Information Retrieval. 265-73. J." Journal of the American Society for Information Science. 20(1)." Information Processing and Management." Report No. France. "Relevance Feedback Revisited. "The Evaluation Problem in Relevance Feedback. "A Cognitive View of Three Selected Online Search Facilities. 1988. "An Experimental Study of Factors Important in Document Ranking.J. CANDELA. "Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. P.. 337-54. 1969." American Documentation. D. 1978. 581-89. HARPER. 1967. INGWERSEN. 1980. N. F. "Towards Interactive Query Expansion. LESK. 20(2) 119-48. Cambridge. Salton. "MEDLARS: Report on the Evaluation of Its Operating Efficiency. Denmark." Paper presented at ACM Conference on Research and Development in Information Retrieval. Relevance Feedback in Document Retrieval Systems: An Evaluation of Probabilistic Strategies. Italy." in The SMART Retrieval System. LANCASTER. J. VAN RIJSBERGEN. "Word-Word Associations in Document Retrieval Systems.. June." American Documentation. Copenhagen. 41 (8). E." Paper presented at ACM Conference on Research and Development in Information Retrieval.ooks_Algorithms_Collection2ed/books/book5/chap11. 1984.: Prentice Hall. Cornell University. WEIDERMAN. HARPER. and N. HALL. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. HARMAN. HARMAN. D. 1986. J. Jesus College. pp. and G. Pisa.. 1969. January. D. HARMAN. HARMAN. IDE.

The SMART Retrieval System (pp. and K. G. 129-46.. PORTER. L. 288-97. 1986.htm (23 of 24)7/3/2004 4:20:48 PM . M. New York: McGraw-Hill." Journal of the American Society for Information Science. 8(6). F. Englewood Cliffs. BOVEY. Inc." Information Storage and Retrieval. 313-23). of the American Society for Information Science.." Information Storage and Retrieval. "Improving Retrieval Performance by Relevance Feedback. 1977.J. SALTON. G. M. "Relevance Feedback in a Public Access Catalogue for a Research Library: Muscat at the Scott Polar Research Institute.).J. T. "Implementing a Probabilistic Information Retrieval System.. "Weighting." Information Processing and Management. 41(4).. and C. SALTON. ROBERTSON. 24(3).Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY MARON. 6(1). H.. M. F. and J. 1960.: Prentice Hall. G. SALTON. J. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. E. J. and V. E. and J. 1-20. M. Inc.ooks_Algorithms_Collection2ed/books/book5/chap11. SALTON." Association for Computing Machinery. 1970. 216-44." Information Technology: Research and Development. "Information Retrieval Through Man-Machine Dialogue. Englewood Cliffs." Documentation. (Ed. RADECKI. 1988. 1982. 1979. N. MCGILL 1983. N. PORTER. KUHNS. J. ODDY. NOREAULT. G. 513-23. 1(2). 1971. "Relevance Feedback in Information Retrieval. Probabilistic Indexing and Information Retrieval. 12. 114.." In Salton G. 22(1). "Improvements to Conventional Boolean Retrieval Systems. "Relevance Weighting of Search Terms. Introduction to Modern Information Retrieval. Doctoral dissertation." Program. The SMART Retrieval System. ZIMMERMAN. "An Evaluation of Query Expansion by the Addition of Clustered Terms for a Document Retrieval System. Syracuse University. 71-5. 1971." Journal of Information Science. User Directed Relevance Feedback. ROCCHIO. D. 329-48. "On Relevance. T.. 7(3). ROBERTSON. 27(3). MACASKILL.. R. and M. BUCKLEY 1990. GALPIN. C. E. MINKER." J. A.. 29-44. School of Information Studies. 1988. Ranking and Relevance Feedback in a Front-end System. 131-156. G. WILSON.: Prentice Hall. and B. "Evaluation Problems in Interactive Information Retrieval. S. 33(1). SPARCK JONES 1976. S. THOMPSON. J. N. 1972. L.

"Query Formulation in Linear Retrieval Models. K." J. 1977.. American Society for Information Science." Information Processing and Management. "What Makes an Automatic Keyword Classification Effective. M. K. VAN RIJSBERGEN 1983. 1981." Paper presented at ACM Conference on Research and Development in Information Retrieval. C. American Society for Information Science. VAN RIJSBERGEN. KAHLE. 1971. 239-46. Y. WONG. 1979b. and C. 339-53." Documentation. 166-75." Information Processing and Management.Information Retrieval: CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY SMEATON. WU.htm (24 of 24)7/3/2004 4:20:48 PM . 29(12).. 37(4). 1990. SPARCK JONES. "A New Theoretical Framework For Information Retrieval. 15(3). F." The Computer Journal. K. C. SALTON. K. "The Estimation of Term Relevance Weights using Relevance Feedback. 26(3). "Parallel Free-Text Search on the Connection Machine System. 13(6). 28(1). Go to Chapter 12 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 1972. SPARCK JONES." Documentation. 194-214." Communications of the ACM. YAO. 1979a. S." Documentation. J. "The Retrieval Effects of Query Expansion on a Feedback Document Retrieval System. VERNIMB. K. "Search Term Relevance Weighting Given Little Relevance Information.ooks_Algorithms_Collection2ed/books/book5/chap11. STANFILL. C. BARBER. "A Statistical Interpretation of Term Specificity and Its Application in Retrieval. SPARCK JONES. and B... 11 -21. 22(3). Italy. and G. 1986.. Pisa. 35(1). 133-44. "Automatic Query Adjustment in Document Retrieval. A. 1229-39. 1986. 41(5)." J. 30-48. SPARCK JONES. O. J. "Experiments in Relevance Weighting of Search Terms.. and E. H. and Y.. 324-29.

User requests are typically phrased in terms of Boolean operations. Most systems rely heavily on the ability to perform Boolean operations. Of course.htm (1 of 41)7/3/2004 4:20:56 PM . The relative performance characteristics of the approaches are shown. These differ from the data stored in traditional database management systems in that they have much less structure. especially when compared to database management systems.Books_Algorithms_Collection2ed/books/book5/chap12. hashing. which are one means of expressing queries in information retrieval systems. and two implementations based on sets are given. This creates a problem for information retrieval systems: queries tend to be vague. However. Techniques for doing so form the subject of this chapter. with their precise characterization of data. Age from Person where Age > 25 the user of an information retrieval system has far less flexibility. Information retrieval systems manage documents. The information retrieval system cannot hope to determine what sequence of characters in the documents under consideration represent age. where an integer may be expected at a particular location or a string at another.. permit a broad range of queries that can exploit known properties of the data. information may be derived from a document (such as inverted indices--see Chapter 3). with no particular structure save units derived from natural languages. Little may be reliably determined beyond the existence of a document and its boundaries. Hence. sentences. or paragraphs. the other. Thus. It has no knowledge of document structure and therefore cannot make inferences about portions of a file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The latter.1 INTRODUCTION This chapter explores one aspect of how information retrieval systems efficiently process user requests. such information is invisible to the user of an information retrieval system. One implementation uses bit vectors.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS CHAPTER 12: BOOLEAN OPERATIONS Steven Wartik Software Productivity Consortium Abstract This chapter presents an overview of Boolean operations. He or she sees only a document. The concepts of Boolean operations are introduced. The data are not organized into neat little units with uniform content. the ability to translate them into a form that may be speedily evaluated is of paramount importance. while a database management system user might issue a query such as: select Name.. 12. such as words.

. often orders of magnitude more than is found in database management systems. An information retrieval system accesses some set of documents. The final section analyzes the run-time properties of the implementation approaches. one might optimize this query by working from sorted data. the size of the data is often enormous. at best with some knowledge of sentence structure. are limited to operations involving character strings.6 present various implementations of this abstraction.5 and 12. A typical query might be.Books_Algorithms_Collection2ed/books/book5/chap12. An information retrieval system. the data sets will be prohibitively large. Section 12. however. In reality. Queries on documents in information retrieval systems.2 discussess Boolean operations in more detail. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. and using search algorithms that capitalize on such order. and how those expressions are translated into sets. but this detail can be safely ignored for now. then. describing precisely their relationship to sets. an addressing table would probably be associated with the names of documents. and a full index is often prohibitively expensive.htm (2 of 41)7/3/2004 4:20:56 PM . linking each name to some disk location. Simple-minded algorithms to process queries will take unacceptable amounts of time and space. . In more sophisticated systems." Despite its seeming simplicity.3 presents an abstraction of sets. This chapter covers how to implement Boolean operations as sets. cannot reorder a document's text. it will suffice to have documents named doc1. Even if indexing is used. the situation can be improved through the use of indices. Sections 12. and also how to efficiently implement set operations. docn. doc2.2 BOOLEAN EXPRESSIONS AND THEIR REPRESENTATION AS SETS Boolean expressions are formed from user queries. 12.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS document. In any case. such a query can be very costly. It emhasizes sets that are expected to contain large numbers of elements. the user enters them directly.. In some information retrieval systems. A search for a word within a document may therefore take time linearly proportional to the size of the document. the user enters natural language. .. In any event. In a database management system. Boolean operations are the key to describing queries. As mentioned. which the system converts into Boolean expressions. and their efficient implementation is essential. The typical technique is to use sets. Information retrieval systems manipulate huge amounts of information. This section will show how Boolean expressions are built from queries. The system must be able to uniquely identify each document. they are then evaluated to determine the results of the query. It is assumed that the name "doci" provides enough information to locate the information associated with document i. Section 12. . In this chapter. but the use of a partial index can limit the flexibility of an information retrieval system. "Give me the names of all documents containing sentences with the words 'computer' and 'science'.

Find all documents containing "information" or "retrieval" (or both). Only documents containing both "information" and "retrieval" will satisfy the first query. is represented as the Boolean expression: not information Most queries involve searching more than one term.Books_Algorithms_Collection2ed/books/book5/chap12. "A set whose elements are the names of all documents containing the pattern "information". By definition. a user might say any of the following: Find all documents containing "information" and "retrieval".Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Boolean expressions are formed from queries. each of which contains the word "information" somewhere within its body. but not both. For example. For example. Find all documents containing "information" or "retrieval". In fact. a query searches a set of documents to determine their content. however. For example Find all documents containing "information". a fact that the corresponding Boolean expression must indicate.. is a query that.. those documents in the second query are the union of the documents in the first and third. the query: Find all documents that do not contain "information". when evaluated. Documents satisfying the third query are a subset of those that satisfy the second." Some queries attempt to find documents that do not contain a particular pattern. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The above is therefore usually represented as the Boolean expression: information which means.htm (3 of 41)7/3/2004 4:20:56 PM . should yield a (possibly empty) set of documents. Most of the words in the sentence are noise. They represent a request to determine what documents contain (or do not contain) a given set of keywords. prefixed to the expression. whereas the second will be satisfied by a document that contains either of the two words. This is done using a "not" operator.

the issue is how to combine those sets to determine what documents might contain both. and "all documents containing 'retrieval' " yields some set D2. Let U represent the names of all documents stored. The queries above would therefore be: information and retrieval information or retrieval information xor retrieval Boolean expressions may be formed from other Boolean expressions to yield rather complex structures...Books_Algorithms_Collection2ed/books/book5/chap12. or. disjunction. These concepts are. how is "all documents containing 'information' and 'retrieval' in the same sentence" represented as a Boolean expression? This question will be answered in other chapters. That is." Here. The concern of this chapter is not to study techniques for determining if a document contains a pattern. and xor. Combining the terms of Boolean expressions is conceptually quite simple. These portions are evaluated separately. conjunction. The information retrieval system must combine these two sets to yield the set D3 that contains only documents with both "information" and "retrieval." and which contain "retrieval. or not containing both "retrieval" and "science". They are represented in Boolean expressions using the operators and. Consider the following query: Find all documents containing "information". respectively. and exclusive disjunction. the issue at hand is how to combine those patterns efficiently to answer a larger query. In other words. or neither. Each portion of a Boolean expression yields a set of documents. "all documents containing 'information' " yields a set of documents D1. either. This translates into the following Boolean expression: (information and retrieval) or not (retrieval and science) Parentheses are often helpful to avoid ambiguity. Rather. For instance. then simplified using techniques that will be discussed in this chapter.htm (4 of 41)7/3/2004 4:20:56 PM .Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Each of the above queries illustrates a particular concept that may form a Boolean expression. given a set of documents known to contain a pattern. other chapters will discuss how to determine which documents contain "information. the assumption is that the respective sets are known." The reader may be wondering how certain complex queries are handled. Let D1 and D2 represent the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. It involves sequences of familiar set operations. "retrieval".

({doc1. The following example illustrates these expressions. doc5}) {doc1.htm (5 of 41)7/3/2004 4:20:56 PM . D1 4. doc4} = {doc1. doc4. doc3. doc3} {doc1. doc4. doc2.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS names of those documents that contain patterns P1 and P2. D1 3. doc5} file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. doc4} = {doc1} whereas "information or retrieval" is: {doc1. respectively. The expressions given above will now be evaluated using the data in Figure 12. doc3} The expression "information and retrieval" is: {doc1. The following list defines how to evaluate Boolean expression operators in terms of the sets: 1. D2 is the set of all documents containing either P1 or P2 (or).1. doc4. but not both (xor). doc4}) {doc2. D1 D2 is the set of all documents containing both P1 and P2 (and). and suppose that they contain the terms shown in Figure 12.. doc2.1. U . 2. For instance..Books_Algorithms_Collection2ed/books/book5/chap12. D2 .D1 is the set of all documents not containing P1 (not). Consider a set of five documents. doc2.D1 D2 is the set of all documents containing either P1 or P2. doc3} . doc2. doc5} {doc1. the Boolean expression "information" is the names of all documents containing the term "information": {doc1. doc3. doc4} As a more complex example: {information and retrieval} or not {retrieval and science} is: ({doc1. doc2. doc2} = {doc2} {doc1. doc3} {doc1. doc3.

it should be no surprise that the familiar set operations of union. algorithm Figure 12.. information. then by definition all values in a set differ. and difference will be required. science pattern.3 OPERATIONS ON SETS Before discussing how to implement sets. A little time spent discussing the meaning of a set as used in this chapter will avoid confusion. These data values need not be unique among all elements of a set. science algorithm. doc3. The presentation will be done file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.htm (6 of 41)7/3/2004 4:20:56 PM . a set has an associated "element data type. science science.Books_Algorithms_Collection2ed/books/book5/chap12. intersection. doc5} Document Terms --------------------------------------------------doc1 doc2 doc3 doc4 doc5 algorithm. If the key constitutes the entire value of a set element. 12. certain supporting operations will be necessary to provide a satsifactory algorithmic implementation of sets. no two elements ever simultaneously possess the same key. it is necessary to define them more precisely. and the semantics of the operations vary from book to book (Does an operation modify its inputs? Is it a procedure or a function?). doc4. From the discussion in the previous section. retrieval. However." and all elements of the set must be of this type.1: Example documents and their terms 12. Each element data type has a key. unordered collection of elements.3. Sets are a familiar concept. however. doc2.. no two textbooks seem to have precisely the same operations. For a given set. In programming language terms. information.1 Set Interface A set is a homogeneous. An element data type may also specify other data that will be included along with the element.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS = {doc1. retrieval retrieval. However. and implementations for them may be found in many data structures textbooks.

wherein a set is presented in two parts--a collection of access programs that define the operations a program may legally perform on a set. Here.> elementType. The existence of a Boolean data type is assumed. pointer-valued parameters are generally used for parameters that a routine is to modify. Some programming languages.htm (7 of 41)7/3/2004 4:20:56 PM ..> set. A few conventions will be used that should be noted now: 1.. short would be the best choice. when the implementation is given. 2. This problem might arise in C if a set contained strings. All operations use pointers to values of type set.Books_Algorithms_Collection2ed/books/book5/chap12.. Moreover. The definition for element types will depend on the type of data to be stored in the set. permit programmers to enforce the constraint that a set contain homogeneous data. The first order of business is to declare a data type for a set.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS using the information hiding principle advanced by Parnas (1972). the programming language used here. only a little care is required to ensure that sets are in fact homogeneous. This point will be discussed in more detail in each implementation. In C.2 shows the operations that constitute the interface to a set. and an implementation for those programs.. This type may be easily declared using the facilities of the C preprocessor: #define TRUE 1 #define FALSE 0 typedef int boolean. To specify a set. 3. In C. This specifies the types of elements that may populate a given instance of a set. or for efficiency. Figure 12. and for elements of the set: typedef <.. The system using the sets will not modify element values once they have been inserted into a set. Consider the following: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. the reason is usually efficiency. typedef <. In C. one must provide an element data type E.. creating typespecific operations is usually better practice. in fact. such as Ada. it suffices to know that such a type is available. For now. this constraint is unenforceable. However. one implementation for sets given here cannot easily represent pointer data. An operation does not modify its parameter unless so noted. it is generally possible to use the type "char*" to store data of any type. The structure of a set will be defined later. However.

void Intersect(set *s1. set *s3). set *s2. Insert(s. b. it will be used in this chapter. elementType e). */ /* Modify s to contain e. Unless explicitly stated otherwise. a set variable may not appear twice in a parameter list. void Unite(set *s1. s). Unite(a.htm (8 of 41)7/3/2004 4:20:56 PM . v[1] = 'x'. That is. the value of v will change inside s. This restriction is actually not necessary for all routines.. b. but occurs frequently enough to warrant its inclusion. /* Make s3 the intersection file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. This problem can be solved easily in two ways: a. b). since s will contain a pointer to the string rather than a unique value. Case a turns out to suffice for information retrieval. neither of the following are legal: Copy (s. void Delete(set *s. set *s2.Books_Algorithms_Collection2ed/books/book5/chap12. Unless Insert has stored a copy of the string. The application will agree not to modify values that are used in sets. The specification of a set can include a function that creates a copy of a datum. */ */ */ /* Remove e from s. if /* it does not already.. 4. void Clear(set *s). v). /* Make s3 the union of sets */ */ */ /* Make s contain zero elements. Since it is simpler. void Insert(set *s. elementType e).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS v = "abc". /* s1 and s2.

boolean Member(set *s. The system will call it.htm (9 of 41)7/3/2004 4:20:56 PM . Once a Boolean expression has been evaluated. set *destination). The system should not depend on a particular implementation of sets. This is exactly what Iterate provides. /* member of s. supplying the element to the routine as a parameter.Books_Algorithms_Collection2ed/books/book5/chap12. boolean Empty(set *s). boolean (*f)()). so it must have some other means to access all elements. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. /* zero elements. */ */ void Subtract(set *s1. */ /* Return TRUE iff s has */ /* Return TRUE iff e is a */ /* Make "destination" the /* same set as "source". suppose a developer writes the following C procedure: boolean PrintElement(i) elementType i. For example. elementType e). /* Make s3 the difference */ /* of sets s1 and s2. /* operation caused an error.2: Set interface operations /* Return TRUE iff the last */ Most of the operations are self-explanatory. /* of sets s1 and s2.. */ */ void Copy(set *source. */ Figure 12.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS set *s3).. The Iterate operation may require some explanation. Iterate will execute this function once for every element in the set. */ /* Call f(e) once for every /* element e in set s. the information retrieval system will need to determine all elements in the set. set *s3). */ */ boolean Error_Occurred(). void Iterate(set *s. as they implement familiar set functions. set *s2. supplying as a parameter a C routine that accomplishes some desired function.

} The error handling will be accomplished such that a set will always be valid according to the rules of its implementation. } The following statement would then print all elements of a set S of integers: Iterate(S. Their packaging must make file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Some error-handling mechanism is therefore needed. PrintElement).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS { printf("The value is %d\n". i). which can be used to check the status of an operation. The Error_Occurred () routine. A system that wished to check the validity of each operation might contain code such as the following: Intersect (s1.2 Packaging The interface given in Figure 12. it is completely random so far as the routine calling Iterate is concerned. Applications may therefore attempt corrective action and try again.htm (10 of 41)7/3/2004 4:20:56 PM . result in an erroneous set..2 can provide an information retrieval system with operations on sets of various types. and FALSE if no error occurred. return TRUE.. s3). if ( Error_Occurred () ) { . 12. These operations may be used by many parts of the system.ooks_Algorithms_Collection2ed/books/book5/chap12...3. if accepted. The only guarantee is that each element in the set is passed to f () exactly once. even if an error occurred. Its value is TRUE if an error occurred during the last executed set operation. This mechanism must allow an information retrieval system to detect when an error has occurred. s2. achieves this. No qualifications are made about the order of the elements in the iteration. It will be possible to pass parameters to the operations that would.

It would contain all of Figure 12. appears in the include file. Their expansion.htm (11 of 41)7/3/2004 4:20:56 PM .Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS this expedient. sci_docs. it is the subject of Chapter 3. retr_docs. Information retrieval systems are large and consist of many subsystems.3 gives the code. but C discourages this. Doing so requires the use of six sets--three for each of the patterns. #include "set. h contains certain type definitions needed in the implementation.3 Example The routines' use will be illustrated by using them to solve the query given at the end of section 12. This file contains specifications of the types. The implementations include routines that improve run-time performance by supplying implementation-specific details.h" SolveQuery () { set info_docs. plus three to hold intermediate values and the results. While this is possible. which might be named "set. which might be named "set. The code for a typical information retrieval system will probably be spread throughout many files. Note that the code has been given independently from the implementation.) 2. The usual C technique for doing this is to package the interface in an "include file" (see Kernighan and Ritchie [1988]).) An implementation of a set would therefore be packaged as follows: 1. involving no duplication of code. That would involve calls to the Insert routine. rather than procedures. The issue of how the sets are initialized is ignored here. (Ideally the definition of the type would not be in the interface.2. (The one exception is routines implemented as macros. Figure 12. one for each document to be inserted in a set. A file containing implementations of the routines. it begins with the line: #include "set. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. but omits any executable code.3.. Since set. which includes executable code. h". subsequent analysis will demonstrate that complete independence is not desirable.. h" 12.2. constants.ooks_Algorithms_Collection2ed/books/book5/chap12. and procedures that make up the set package. The set operations must be packaged such that each subsystem--that is. plus the type definition and accompanying constants for the set data type. An include file. c". set of files--needing access to sets may have them in the simplest possible way.

htm (12 of 41)7/3/2004 4:20:56 PM . &t1). retr_docs. &t2. Intersect(&retr_docs. Clear (&sci_docs). Clear (&t3). &t3). and put their names into info_docs. Clear (&t2). Clear (&retr_docs).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS set t1. &retr_docs. &retr_docs. Subtract(&t3. &t2)..4 GENERAL CONSIDERATIONS FOR IMPLEMENTING SETS file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD... Clear (&info_docs)... &sci_docs. t2. and sci_docs.. &t1). Unite(&t1. Find the documents containing "information". */ } Figure 12. Intersect(&info_docs. respectively . t3. Clear (&t1). "retrieval" and "science". /* Set t1 contains the result of the query. .3: Solving a simple set query 12.ooks_Algorithms_Collection2ed/books/book5/chap12.

Of course. indeed by many orders of magnitude. One implementation approach to sets might not be enough.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Before considering specific techniques for implementing sets. Two implementation strategies are given. some documents will have tens of thousands of unique index terms on which users may wish to search. and practical only for precomputed indices. it is worth mentioning the trade-offs one may encounter. if the indices are sorted. . 12. The desire to use this approach may also drive other design decisions in an information retrieval system. For instance. stable information base is unwise. and therefore will help the reader understand the benefits of other approaches he or she may come across. See Sedgewick (1990) for an excellent treatment of the topic. The same information retrieval system may well access both types of documents. The ones given here attempt to do so. the names can be mapped much more rapidly than if they are unsorted. This discussion illustrates the complexity that arises from a seemingly simple choice. and will lead to unacceptable performance.5 BIT VECTORS file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Every ounce of computing power must therefore be squeezed out of the algorithms that implement Boolean operations. Sets for the former may easily be kept in memory and manipulated using fixed-space algorithms. Then again. Indeed. the amount of information on one's system seldom shrinks. Terabyte systems are not uncommon. its performance is. Furthermore. in most respects.. sorting the names is an expensive operation. . They sometimes sacrifice clarity for speed. intermediate with respect to the approaches in this chapter. The ever-increasing amount of knowledge has made gigabyte information bases commonplace. doc2. but a reader familiar with the C programming language should be able to figure them out without undue effort. Small documents.htm (13 of 41)7/3/2004 4:20:56 PM . Planning for a small. may have only a few terms. This only scratches the surface of possibilities. Such decisions may influence the architecture of the entire system.ooks_Algorithms_Collection2ed/books/book5/chap12. Whether performing the mapping for each set will result in time savings depends on the document names and the hardware available. however. Space and time considerations are as applicable in set algorithms as in any other area. large rather than small sets are likely to be the exception rather than the rule. or ones being queried in the context of a particular domain. . the sets used in information retrieval can vary greatly in size. Sets for the latter may require secondary memory implementation strategies. The technique of balanced trees is worthy of consideration. but--unless one's documents really are named doc1. Certain external factors may drive which ones are practical. then. one technique (bit vectors) is very fast but relies on representing document names as unique integers.--potentially time-consuming.. The implementor of an information retrieval system should carefully consider the relationship of the two approaches. The two approaches are sufficiently diverse as to be representative of most major trade-offs. However. they will probably be the rule as global networks proliferate. Doing so is certainly feasible. Document names are maintained in inverted indices. In any case.

A logical approach is to have document i map to bit i. FALSE means it is not) than do the digits. Suppose there are 20 documents. as will be seen. word. of the 20 documents. then the element represented by that bit is in the set. 11. or the like. if 0. in C the first bit has index 0. and 12 contain patterns of interest. not 20. and then applying a "mask-and-shift" sequence (illustrated below) to recover specific bits. and suppose further that they may be uniquely identified by numbering them from 1 to 20. It is assumed that each value in the element's domain maps to a unique bit. The first task is to determine the mapping between the numbers and bits. not 1.. Access to individual bits is achieved by first accessing the byte containing the desired bit(s). They provide an extremely fast implementation that is particularly suited to sets where the element type has a small domain of possible values. Therefore. document i actually maps to bit i . inclusive. Most computers do not understand bit strings of arbitrary lengths. If a bit's value is 1. Consider an example.htm (14 of 41)7/3/2004 4:20:56 PM . A program must reference a minimum or maximum number of bits in a given operation--usually corresponding to a byte. only documents 3.1. with one bit for each possible element in the domain of the element type. The technique takes full advantage of computer hardware to achieve efficiency of operations. Most computers have instructions to perform and. xor. The representation is not compact. Suppose the system now needs to know whether document i is in the set. which provides a clearer understanding of the bit's purpose (TRUE means an element is in a set. The reason is that computers are not usually capable of storing exactly 20 bits. The bit vector set to represent this consists of 0's at all elements except those four positions: 001000010011000000000000 The alert reader will notice that this string actually consists of 24 bits. A string of 20 bits will be used to represent sets of these documents. 8. or referenced bit-bybit. Most store information in multiples of 8 bits. It is therefore necessary to arrange the bit string such that it can be treated as a whole. which provides access to bit-level operations. The usual approach is to declare a bit string as a sequence of contiguous storage. is well suited to bit vector implementations. and not operations on bit strings. Suppose a search reveals that. There is no C operation that file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap12. The approach using bit vectors is to represent a set as a sequence of bits. The C programming language. Padding the string with zeros is therefore necessary. as appropriate. These values map to the TRUE and FALSE constants of the Boolean data type. or. However. an assumption made here. the element is not in the set.. but may be appropriate if the number of documents is small.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Bit vectors are a simple approach to implementing sets.

456 bytes. Therefore. Since there are 8 bits in a byte. bit i . If the resulting byte is nonzero. if it is zero. many compilers that use bit vectors to implement sets restrict the domain to that of the maximum number of bits in a single word-typically 60 or 64.and lowercase letters has 5210 possible values. then. an amount beyond most computers' capacities. bit i . i++ ) s3[i] = s1[i]&s2[i]. i < length(s1).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS allows bit i . the first byte is indexed by 0.1 must have been 1. then bit i .1. adding an element to a set--is performed in essentially the reverse manner. A 10-element character string restricted to upper. This usually limits their utility to sets of single characters.. However.1 must have been 0. The first step. each byte stores information on up to 8 elements of a set. since the mapping scheme sequences bits in order. there are enough small domains to warrant the consideration of bit vectors as a set implementation technique. However. and then and'ing the byte with 1.1 and then setting all other bits in that byte to zero. the intersection of two sets as bit vectors is just a loop that uses C's & operation: for ( i = 0. Shifting the bit to the first position of the byte.1.2. such as might be used to hold document names. it can be done by extracting the byte containing bit i .1 is in byte (i . bit strings will be impractical for such situations. Representing a set of integers on a 32-bit computer requires bit vectors of length 231/8 = 268.ooks_Algorithms_Collection2ed/books/book5/chap12. Clearly. yields a byte that contains 1 if the bit is 1.htm (15 of 41)7/3/2004 4:20:56 PM . The situation is even worse if the element type is character strings.. The drawback of using bit vectors arises when the domain of the element type is large. A set implemented using the above mapping still requires the full amount of space to store the two-element set containing -231 and 231 . Knowing that the set will only contain a few of these elements at any time does not help. The complete C expression to access the bit is therefore: ( vector [ (i-1) /8] >> (i-1)%8 ) & 01 The inverse operation--setting a particular bit.435. is to compute the byte containing bit i . The next step is to isolate the individual bit. This will require adding a new function to the interface given in Figure 12.1)/8 (as with bits. and 0 if the bit is 0.1 to be retrieved directly. there must be a way to specify the domain of elements of a set. If large bit vectors are impractical. Modulo arithmetic is used to do so: the bit is at location (i 1 ) mod 8 within the byte. that is. The function will create a new file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. For instance. assuming that length (s) returns the number of bytes needed to represent set s. not 1). An understanding of bit-manipulation concepts makes the implementation of the remaining operations straightforward. Indeed.

{ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. It is possible to provide more flexibility. and will refuse to operate on sets of two different domains. The implementation also defines the word size--that is.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS set capable of storing elements from a given range of values. This implementation rather arbitrarily restricts the maximum number of elements that a set may contain to 256. all sets will be treated as sets of integers between lower and upper. The implementation for Create is: void Create(lower. inclusive. Most C implementations have 8 bits in a datum of type char. and 16 or 32 in an int. and the bit string defining the set's elements at any time. a set will need to include information on elements. Empirical analysis is the best way to resolve the question. the number of bits per word in a value of type char-at 8.. Since set operations need to know the domain of elements. s) lower. In other words.htm (16 of 41)7/3/2004 4:20:56 PM . The set operations will return sets of the same domain as their parameters. Such flexibility is better achieved using other representations.ooks_Algorithms_Collection2ed/books/book5/chap12. will contain three pieces of information: the lower and upper bounds of the set. bits[MAX_ELEMENTS\WORDSIZE]. The following structure presents this information: #define MAX_ELEMENTS 256 #define WORDSIZE 8 typedef struct int int char } set. set *s). int upper. but doing so defeats the point of using bit strings for fast access--the system must worry about garbage collection issues.. upper. and the types of accesses to sets that occur most frequently. by making bits a pointer and allocating space for it at run time. void Create(int lower. The implementation for sets. It is worth noting that using an int array might prove faster. The decision will depend on the underlying hardware. then. upper. using an int might make the operations that manipulate entire bytes more efficient.

upper. Each set operation has the responsibility to set this variable to either TRUE or FALSE. As mentioned above.htm (17 of 41)7/3/2004 4:20:56 PM . s->upper = upper. The first few lines. return. if ( lower > upper || (upper .lower) >= MAX_ELEMENTS ) { error_occurred = TRUE.. } s->lower = lower.ooks_Algorithms_Collection2ed/books/book5/chap12. This routine will be implemented as a reference to a "hidden" variable: static boolean error_occurred. *s. bear explanation. error_occurred = FALSE. errors will be flagged through the use of a function Error_Occurred. } The routine simply initializes the bounds of the set.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS int set { lower.. The implementation of Error_Occurred is then: boolean Error_Occurred() { return error_occurred. which would appear in the set's implementation file. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. which deal with errors.

Number_Of_Bytes. if it is outside the lower and upper bounds. An alternate approach would be to indicate that it is not in the set. the set is not guaranteed empty. respectively. e) set *s. and'ing this string with a byte will result in a byte unchanged except for (possibly) the bit at the location of the 0. or'ing this string with a byte with result in a byte unchanged except for (possibly) the bit at the location of the 1 in the bit string. for ( i = 0. Figure 12. Insertion and deletion of elements are handled by setting the appropriate bit to 1 and 0. without also flagging an error. in simplified form. that is. with a 1 at the appropriate bit. this weakens the abstraction. Figure 12. It still returns FALSE since.5 shows the complete C routines. as a function. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Note that Member is defined to be erroneous if e is not of the correct domain. that is. The routines mask the byte containing the appropriate bit with a bit string that turns the bit on or off. it must return some value. void Insert(s.htm (18 of 41)7/3/2004 4:20:56 PM . respectively. The bit string for Delete consists of a string of 1'S..Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Note that create does not initialize the set's contents. with a 0 at the appropriate bit. } Since a zero indicates that an element is not in a set. and since s's bits are all zeros. set s contains no elements. *s.. i++ ) s->bits[i] = 0. The code for routines that test properties of a set was given above.s-lower)/WORDSIZE + 1.4 shows the code to do so.ooks_Algorithms_Collection2ed/books/book5/chap12. error_occurred = FALSE. However. C's bit-oriented operators make this simple. void Clear(s) set { register int i. The bit string for Insert consists of a string of 0's. Number_Of_Bytes = (s->upper . i < Number_Of_Bytes.

} s->bits[(e-s->lower)/WORDSIZE] |= 01 << ( (e-s->lower)%WORDSIZE). . e) set elementType { if ( e < s->lower || e > s->upper ) { error_occurred = TRUE.. } void Delete(s. return. error_occurred = FALSE. } s->bits[(e-s->lower)/WORDSIZE] &= ~(01 << ((e-s->lower)%WORDSIZE)).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS elementType { e. if ( e < s->lower || e > s->upper ) { error_occurred = TRUE. e. } Figure 12.ooks_Algorithms_Collection2ed/books/book5/chap12.htm (19 of 41)7/3/2004 4:20:56 PM *s. return.4: Inserting and deleting elements in a bit-vector set file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. error_occurred = FALSE.

} file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (20 of 41)7/3/2004 4:20:56 PM . return TRUE.s->lower)%WORDSIZE) & 01.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS boolean Empty(s) set { register int i. e) set *s.s->lower)/WORDSIZE] >> (e .s->lower)/WORDSIZE + 1.ooks_Algorithms_Collection2ed/books/book5/chap12. i++ ) if ( s->bits[i] ) return FALSE.. e. Number_Of_Bytes. } boolean Member(s. error_occurred = FALSE. elementType { if ( error_occurred = (e < s->lower || e < s->upper) ) return FALSE.. for ( i = 0. i < Number_Of_Bytes. Number_Of_Bytes = (s->upper . return (s->bits[(e . *s.

return. It is left as an exercise. s3)) ) { error_occurred = TRUE. s3) set set { register int i.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Figure 12. *s2. s2. Number_Of_Bytes. This means that entire bytes can be combined using C's bit operators. The operations will be restricted such that all three parameters must have the same bounds.) Two sets thus constrained will have the same mapping function.5: Empty and Member bit-vector routines The Intersect.6-12. so bit i of one set will represent the same element as bit i of another. if ( ! (equivalent(s1. The equivalence of two sets is tested using the macro: #define equivalent(s1. i++ ) s3->bits[i] = (s1->bits[i] | s2->bits[i]).s1->lower)/WORDSIZE + 1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. but is somewhat harder for systems to use. *s1.. error_occurred = FALSE. for ( i = 0 .8 show the code. Unite and Subtract routines are next. s2) && equivalent(s2. void Unite(s1. *s3. i Number_Of_Bytes..ooks_Algorithms_Collection2ed/books/book5/chap12. } Number_Of_Bytes = (s1->upper . (Extending the operations to accommodate sets with other bounds is not difficult. This makes the bit-vector approach simple.htm (21 of 41)7/3/2004 4:20:56 PM . s2) \ ((s1)->lower==(s2)->lower && (s1)->upper==(s2)->upper) Figures 12.

destination) set *source. Using bit vectors. since C does not permit assignment of entire arrays: void Copy(source. i < MAX_ELEMENTS/WORDSIZE. The bits array must be copied byte by byte. if ( ! equivalent(source. } void Intersect(s1. destination) ) { error_occurred = TRUE. *s2.htm (22 of 41)7/3/2004 4:20:56 PM *s1. s2. Number_Of_Bytes. { register int i. return. i++ ) destination-bits[i] = source->bits[i].ooks_Algorithms_Collection2ed/books/book5/chap12.6: Unite bit-vector routine Applications sometimes need to copy sets.. } for ( i = 0.*destination. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. this is straightforward.. *s3. s3) set set { register int i. error_occurred = FALSE. .Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS } Figure 12.

s2) && equivalent(s2. i < Number_Of_Bytes. error_occurred = FALSE.htm (23 of 41)7/3/2004 4:20:56 PM . } Number_Of_Bytes = (s1->upper . Number_Of_Bytes. s3) set set { register int i. return. s3)) ) { error_occurred = TRUE.ooks_Algorithms_Collection2ed/books/book5/chap12. } Number_Of_Bytes = (s1->upper . for ( i = 0 . if ( ! (equivalent(s1..s1->lower)/WORDSIZE + 1. *s2. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.s1->lower)/WORDSIZE + 1. i++ ) s3->bits[i] = (s1->bits[i] & s2->bits[i]). } Figure 12. *s1. return. s3)) ) { error_occurred = TRUE. *s3. s2.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS if ( ! (equivalent(s1.. s2) && equivalent(s2.7: Intersect bit-vector routine void Subtract(s1.

Number_Of_Bytes. i < Number_Of_Bytes. i < Number_Of_Bytes. This requires converting the bit position into the value. f) set *s.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS for ( i = 0 .9 shows the code. the inverse mapping from that required for other operations. error_occurred = FALSE. i++ ) for ( j = 0. } Figure 12. but the actual value. Number_Of_Bytes = (s->upper . error_occurred = FALSE. void Iterate(s. Figure 12. The function is expected to return a Boolean value. j. boolean (*f) ().. j < WORDSIZE. This operation uses C's dereferencing features to execute an applicationspecified function. for ( i = 0. if FALSE.. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (24 of 41)7/3/2004 4:20:56 PM . The function f receives each element--not the bit. i++ ) s3->bits[i] = s1->bits[i] & ~ (s2->bits[i]).s->lower)/WORDSIZE + 1. { register int i. This lets applications exert some control over the iteration.8: Subtract bit-vector routine The final operation is iteration. another element will be passed to f. the iteration will cease. If this value is TRUE. j++ ) if ( (s->bits[i] >> j) % 2 ) if ( ! (*f)(j + i*WORDSIZE + s->lower) ) return.ooks_Algorithms_Collection2ed/books/book5/chap12.

however. An element is inserted in a set by inserting it into the set's hash table. but it makes far more efficient use of space. Chapter 13 gives an in-depth presentation of hashing. good performance characteristics can be achieved if one has some idea in advance of the average size of the sets that will be manipulated.ooks_Algorithms_Collection2ed/books/book5/chap12.. whereas bit vectors explicitly restrict the maximum number of elements. Each set is implemented using a single hash table. The application must provide some hints to the set operation routines about the nature of the data. Hashing is actually useful in implementing many important IR operations. mapping from the domain of the set elements to integers. This section will show how to use hashing. The concepts underlying the hashing-based implementation are straightforward. The approach is not as fast as bit vectors. The implementation will use chaining.9: Bit-vector iterator routine 12. which is a difficult operation.. or if representing their names using integers is not convenient. It will supply the number of buckets and a hashing function â(v). No assumptions will be made about the type of data.6 HASHING If a large number of documents must be searched. hashing permits sets of arbitrary size. and it will be based on this information. In return. The implementation of the set. There will be some enforcement of set types. a quantity that is usually available. However. then bit vectors are unacceptable. but risky.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS } Figure 12. A plausible implementation of the set type is: typedef char element_Type. then. typedef struct be_str { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. In fact. another common set implementation technique. It is possible. This implementation will be less restrictive about typing than the implementation given for bit vectors.htm (25 of 41)7/3/2004 4:20:56 PM . The results will be unpredictable if this requirement is violated. however. The information will be passed as part of the process of creating a set. the application using the hash table will be responsible for assuring that the elements are all homogeneously typed. Readers not familiar with hashing may wish to study that chapter before reading any more of this section. Its presence is verified by examining the appropriate bucket. to use open addressing. as explained later. This lets applications achieve much better performance. will need to maintain this information. not open addressing. A hash table must be resized when full. it will be possible to store any type of data--not just integers. The trade-off is in speed: performance is inversely proportional to a set's cardinality.

The first type definition gives the data type for set elements.ooks_Algorithms_Collection2ed/books/book5/chap12.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS elementType struct be_str datum. and FALSE if they are not.. given two values of type elementType. This routine. since C's "= =" operator does not understand such structures. Hashing_Function.. int Number_Of_Buckets. Comparator. void Create(Number_Of_Buckets. } bucket_element. as in bit vectors. The third definition contains the information needed for a set. A "Comparator" routine must also be provided. int boolean } set. typedef struct { int Number_Of_Buckets. The restrictions on set equivalence can be greatly relaxed for hash sets. The second definition is used to construct the linked lists that form the buckets.10. Such a routine is needed for hash tables containing strings or dynamically allocated data. should return TRUE if they are equal. reflecting the different information needed to maintain and manipulate the sets. only a pointer is included in the definition. *next_datum. The appropriate hashing function must be applied--that is.htm (26 of 41)7/3/2004 4:20:56 PM . a C convention for a variable to contain information of any type. There are no bounds that need to be identical. s) (*hashing_function)(). The array of buckets is allocated when the set is created. Each datum is of type "char *". file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The creation routine is given in Figure 12. Note that its interface is different from the bit-vector approach. Note that the buckets are not preallocated. the hashing function from one table should not be used on another--but adhering to this rule is simple. bucket_element **buckets. (*comparator)(). since the function is included in the structure.

if ( Number_Of_Buckets <= 0 ) { error_occurred = TRUE. } Figure 12..Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS int (*Hashing_Function)(). if ( error_occurred = (s->buckets == NULL) ) return. } s->Number_Of_Buckets = Number_Of_Buckets. i < Number_Of_Buckets. s->hashing_function = Hashing_Function. set { register unsigned int register int Bucket_Array_Size. return. *s. i. for ( i = 0.ooks_Algorithms_Collection2ed/books/book5/chap12. s->buckets = (bucket_element **)malloc(Bucket_Array_Size). Bucket_Array_Size = sizeof(bucket_element) * Number_Of_Buckets. i++ ) s->buckets[i] = (bucket_element *)NULL.htm (27 of 41)7/3/2004 4:20:56 PM .10: Hashing version of Create file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. boolean (*Comparator)(). s->comparator = Comparator.

Create leaves the set in an uninitialized state. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." This does leave Clear able to de-allocate space: void Clear(s) set *s. That. Its implementation is left as an exercise to the reader. b = next_b. This will require the introduction of a "Destroy" operation.. for ( i = 0. { register int i. lest the application eventually run out of space. the presence of such storage is indicated by a non-null pointer. is used to provide space for the buckets. needing to be cleared before use.htm (28 of 41)7/3/2004 4:20:56 PM .ooks_Algorithms_Collection2ed/books/book5/chap12. i < s->Number_Of_Buckets.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS C's global memory allocation routine. leaves the set "cleared. Therefore. It must be freed when no longer needed. i++ ) if ( s->buckets[i] != NULL ) { b = s->buckets[i].. In the bit-vector implementation. malloc. This is not feasible in hashing. The reason is that clearing the set involves freeing dynamically allocated space. The use of dynamic memory requires some care. free( (char *)b ). register bucket_element *b. whose purpose is opposite of Create. In C. Clear might become confused if Create did not set all buckets to . *next_b. } s->buckets[i] = NULL. while ( b != NULL ) { next_b = b->next_datum. however.

This version of Insert ( ) inserts an element by traversing the list associated with the bucket to which it maps. Note that if the datum component of the node points to dynamically allocated memory. Figure 12. elements added recently tend to be accessed most often. } Almost all the operations on sets will use the hashing function..e) (abs ((* ((s)->hashing_function)) (e)) % (s)->Number_Of_Buckets) Insertion and deletion into a hash table involve linked-list operations. an extension left to the reader. then scan through each element of s2. its range is all integers rather than all integers between a certain bucket size. Data sets for which this does not hold could use an implementation where the element is added at the end of the list. Figure 12. adding the element to s3 if it is not already there. This situation may be rectified by adding a Free_datum function parameter to Create. Note that application-supplied hashing function is equivalent to â (v).htm (29 of 41)7/3/2004 4:20:56 PM . done using linked-list traversal.. The approach used is to add each element of set s1 to s3. e) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. void Insert(s.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS } error_occurred = FALSE. Constructing the union of two sets implemented using hashing is somewhat more involved than is the operation with bit vectors. Insertion at the head is common because. All elements that hash to the same bucket will be in the same list. that is.11 and 12. Deleting an element is done using a standard linked-list technique: the list is searched for the element while a pointer is kept to the previous node. that memory will be lost unless some other node points to it. but the respective lists may be ordered differently--a list's order depends on the order in which elements were inserted. the pointer to the previous node is used to make that node point to the one following the element to be deleted. No one-to-one correspondence can be established between elements. the following C macro will be used: #define hash(s.13 contains the code for this. it is inserted at the list's head. The storage for the deleted node is then freed. Figures 12.14 shows the implementation. If the element is found. not h (v). Rather than repeat it in each routine. Testing whether the set is empty involves testing if there are any pointers to elements. If the element is not already in the list. in practice.12 contain one implementation. This simplifies dynamic resizing of hash tables (an improvement discussed in Chapter 13). It also contains the implementation of Member.ooks_Algorithms_Collection2ed/books/book5/chap12.

Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS set elementType { *s.htm (30 of 41)7/3/2004 4:20:56 PM .. e) ) return. bucket.11: Inserting elements in a hash table set file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } Figure 12. *New_Element. New_Element->next_datum = s-->buckets[bucket]. e. return..ooks_Algorithms_Collection2ed/books/book5/chap12. } New_Element->datum = e. New_Element = (bucket_element *)malloc(sizeof (bucket_element)). for ( b = s->buckets[bucket]. bucket = hash(s. *b. b != NULL . s->buckets[bucket] = New_Element. if ( New_Element == NULL ) { error_occurred = TRUE. b = b->next_datum ) if ( (*(s->comparator))(b->datum. register bucket_element register int error_occurred = FALSE.e).

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. b = b->next_datum. *previous. e). e) ) s->buckets[bucket] = b->next_datum. if ( (*(s->comparator))(b->datum. e) set elementType { register bucket_element *b.htm (31 of 41)7/3/2004 4:20:56 PM *s.. else { while ( b->next_datum != NULL ) { if ( (*(s->comparator))(b->datum.ooks_Algorithms_Collection2ed/books/book5/chap12. previous = b.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS void Delete(s. if ( (b = s->buckets[bucket]) == NULL ) return. bucket. bucket = hash(s. register int error_occurred = FALSE.. . } if ( b == NULL ) return. e. e) ) break.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. e) set elementType { register bucket_element *b. } Figure 12. i++ ) if ( s->buckets[i] != NULL ) return FALSE. return TRUE. for ( i = 0. } boolean Member (s.. *s.12: Deleting elements from a hash table set boolean Empty(s) set { register int i. *s. e.ooks_Algorithms_Collection2ed/books/book5/chap12. } free ( (char *)b ).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS previous->next_datum = b->next_datum.htm (32 of 41)7/3/2004 4:20:56 PM .. i < s->Number_Of_Buckets. error_occurred = FALSE.

15 and 12.htm (33 of 41)7/3/2004 4:20:56 PM . *b. b != NULL . e). void Unite(s1. *s1. subtracting works by adding to s3 those elements that are in s1 but not s2. s2. b = b->next_datum ) if ( (*(s->comparator))(b->datum. They are shown in Figures 12. Intersecting involves traversing through s1 and adding to s3 those elements that are also in s2. *s3.ooks_Algorithms_Collection2ed/books/book5/chap12. e) ) return TRUE. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. bucket. s3)..13: Empty and Member hashing routines The concepts behind the implementations of intersection and subtraction are similar.. if ( Error_Occurred() ) return. i. bucket = hash(s. *s2. for ( b = s->buckets[bucket].Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS register int error_occurred = FALSE. } Figure 12.16. s3) set set { register int register bucket_element Copy(s1. return FALSE.

i < s1->Number_Of_Buckets.htm (34 of 41)7/3/2004 4:20:56 PM . for ( i = 0. b->datum) ) { Insert(s3. i++ ) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } } error_occurred = FALSE.. if ( Error_Occurred() ) return.14: Unite hashing routine void Intersect(s1.ooks_Algorithms_Collection2ed/books/book5/chap12. } Figure 12. *s1. i++ ) { if ( s2->buckets[i] == NULL ) continue. *s2. b->datum). i < s2->Number_Of_Buckets.. s3) set set { register int i. *s3. b = b->next_datum ) if ( ! Member(s3.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS for ( i = 0. b != NULL. Clear(s3). s2. register bucket_element *b. for ( b = s2->buckets[i].

15: Intersect hashing routine void Subtract (s1. b->datum). i++ ) { i. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. if ( Error_Occurred() ) return.. for ( i = 0. s3) set set { register int register bucket_element Clear(s3). } Figure 12.htm (35 of 41)7/3/2004 4:20:56 PM . b->datum) ) { Insert(s3. i < s1->Number_Of_Buckets. for ( b = s1->buckets[i].ooks_Algorithms_Collection2ed/books/book5/chap12. *s1. *b. } } error_occurred = FALSE. *s3. b != NULL. s2. b = b->next_datum ) if ( Member(s2..Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS if ( s1->buckets[i] == NULL ) continue. *s2.

16: Subtract hashing routine Copying a hash table involves copying the bucket array. The code to do so is in Figure 12. The final operation is iteration.17. since the two may not have identical bucket array lengths or hash functions. they will not be identical after the copying operation: the order of elements within the buckets is reversed.ooks_Algorithms_Collection2ed/books/book5/chap12. the data from one must be rehashed into the other. Even if the tables are equivalent in these respects.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS if ( s1->buckets[i] == NULL ) continue. b->datum) ) { Insert(s3.7 ANALYSIS To aid the reader in understanding the relative merits of the two implementations.htm (36 of 41)7/3/2004 4:20:56 PM . Unlike bit vectors. Since the order of elements during iteration is not defined. b = b->next_datum ) if ( ! Member(s2...18. The order will depend on both the hashing function and the order of insertion. It is shown in Figure 12. } Figure 12. The analysis for file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. The routine does not simply copy the lists. if ( Error_Occurred() ) return. passing each datum to the function provided to Iterate. } } error_occurred = FALSE. this section presents an analysis of the time and space required for the two set implementation techniques. this traversal will almost certainly not yield the elements in any particular oder. the difference is irrelevant. This will only show up during iteration. for ( b = s1->buckets[i]. The technique to be used is similar to that of Copy: traverse through each list in each bucket. plus all lists in the buckets. 12. b != NULL. b->datum).

b != NULL. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. for ( i = o. i++ ) { if ( source->buckets[i] == NULL ) continue. destination) set *source. } e->datum = b->datum.htm (37 of 41)7/3/2004 4:20:56 PM . { register int register bucket_element Clear (destination). *destination. *b. b->datum). for ( b = source->buckets[i]. all routines will be analyzed. i < source->Number_Of_Buckets. since it is presumed that most applications will spend the majority of their time manipulating sets rather than creating them. if ( e == NULL ) { error_occurred = TRUE. e->next_datum = destination->buckets[h].. i h. Intersect. return. e = (bucket_element *)malloc(sizeof (bucket_element)). The operations of most concern are Insert. b = b->next_datum ) { h = hash(destination. For the sake of completeness. however. Unite.ooks_Algorithms_Collection2ed/books/book5/chap12. void Copy(source. *e. and Subtract..Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS each is straightforward.

i++ ) for ( b = s->buckets[i].17: Copying a hash table set void Iterate(s.ooks_Algorithms_Collection2ed/books/book5/chap12.. the insertion and deletion operations are O(C). *s. *b. for ( i = 0.. (*f) ().18: Hashing iterator routine For bit vectors.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS destination->buckets[h] = e. } Figure 12. i < s->Number_Of_Buckets. The exact time will depend on the speed of the division operation on the underlying hardware. b != NULL. b = b->next_datum ) if ( ! (*f) (b->datum) ) return. f) set boolean { register int register bucket_element error_occurred = FALSE. } } error_occurred = FALSE.htm (38 of 41)7/3/2004 4:20:56 PM . } Figure 12. the i. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. that is. their running time is constant.

The behavior of hashing algorithms is not quite so predictable. The expected running time is significantly better.1 summarizes this data for both approaches. making it closer to O(N).Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS fundamental bottleneck in the implementation (bit operations such as &= and are usually much quicker). and membership tests all require O(N). and difference. The actual value will be closer to N/ WORDSIZE. since the number of elements in a set is usually very small compared to the number of elements in a domain. Consider the Unite operation on two sets that each have all elements in a single bucket.ooks_Algorithms_Collection2ed/books/book5/chap12. and difference routines have significantly poorer worst-case behavior. Of course. Table 12. lowering the expected complexity to O(C). and if the number of elements N is less than the number of buckets B. MAX_ELEMENTS/WORDSIZE+2 sizeof (int) bytes will be needed to store any bit-vector set. Table 12. Moreover. and the expected time for uniting. They are also constant. Assuming that the hashing function is "reasonably" random. As noted previously. all elements will be in a single bucket.. This is likely to be a great improvement over bit vectors. Since the inner for-loop iterates through N elements. then insertion. Even so. deletion. or subtracting two sets would be O(N). The time required once the hashing function is computed will be proportional to the number of elements in the bucket. this can easily be rather large. It depends on the randomness of the hashing function on the data being stored. the complexity of the loop is O (N2). which iterate across the set.htm (39 of 41)7/3/2004 4:20:56 PM . intersecting. intersection. since the operations are able to access entire words rather than individual bits. In the implementation given here..1: Relative Algorithmic Complexity for Different Set Implementations Hashing Worst-Case Average Bit Vectors Worst-Case Average ~ ------------------------------------------------------------------- file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Assuming this quantity to be N. The hashing function is usually chosen to be O(C) (nonlinear functions defeat the advantages of hashing). The other operations. The Member() test requires O(N) steps for a single invocation. The exception is Iterate( ). Thus. intersection. then the expected number of elements in any bucket is less than one. and membership testing would be constant. worst-case and expected behavior are not difficult to derive. If the elements are distributed randomly. N cannot easily be predicted in advance. deletion. the expected time for insertion. At worst. so N may very well exceed B. where N is the number of elements in the set. but potentially quite large. This logic also applies to Intersect and Subtract. which must scan every bit. Worst-case will be presented first. N must reach B2 before the performance degrades to O (N2) for union. The union. they require O(N) steps. The space requirements for bit vectors are somewhat worse. the number of buckets can also be increased at any time to exceed N. have running time that depends on the set domain. If so.

. or long).2: Relative Space Requirements for Different Set Implementations Bit Vectors Domain Size Number of Elements 64 1024 215 Domain Size Hash Tables Number of Elements 64 1024 215 --------------------------256 32 --- ---------------------------256 1198 4078 99. Table 12. The size is determined more by the number of elements than the size of the domain. a hash table of N element requires BP + NP + NE units. The exact formula for the space depends on the number of buckets as well.ooks_Algorithms_Collection2ed/books/book5/chap12..e. and that each element requires E units of space.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS Insert/Delete/Member Unite/Intersect/Subtract Create/Empty Copy N = number of elements C = constant B = number of buckets in hash table W = number of words in bit string O(N) O(N2) O(B) O(N) O(C) O(N) O(B) O(N) 0(C) O(W) O(W) 0(W) O(C) 0(W) 0(W) 0(W) Hash tables require considerably less space than bit vectors. Assuming that P is the space required to store a pointer. Table 12.htm (40 of 41)7/3/2004 4:20:56 PM .310 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. implying that any hash table requires at least 1. The tables show how the advantages of bit vectors rapidly diminish in proportion to the size of the number of elements in the domain. The sets are representing integers. short. Bytes are assumed to contain 8 bits.006 bytes.. int. it is assumed that the minimum possible size will be used (i.2 contrasts the space requirements for bit vectors and hash tables. pointers 16. The hash table has 503 buckets.

. 2nd ed.Information Retrieval: CHAPTER 12: BOOLEAN OPERATIONS 215 231 4096 228 4096 228 4096 228 215 231 1262 1390 5102 7100 132. Englewood Cliffs. Algorithms in C. M. N.. Reading.: Prentice Hall. W. 1972. B. Go to Chapter 13 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 15(12). PARNAS. Communications of the ACM.614 REFERENCES KERNIGHAN. The C Programming Language. and D. L. 1988.ooks_Algorithms_Collection2ed/books/book5/chap12. RITCHIE.078 197. D. On the Criteria to be Used in Decomposing Systems into Modules. R. Mass: Addison-Wesley. SEDGEWICK.J.. 1053-58.htm (41 of 41)7/3/2004 4:20:56 PM . 1990.

Relevant factors include some knowledge of the domain (English prose vs.. given a particular value.ooks_Algorithms_Collection2ed/books/book5/chap13. Hashing does have some drawbacks. Space use is not optimal but is at least acceptable for most circumstances.2 CONCEPTS OF HASHING file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. but without derivation. The chapter also contains an extensive discussion of perfect hashing. Chapters 3-6 have dealt with useful techniques for organizing files for efficient access. Information can usually be accessed in constant time (although the worst-case performance can be quite poor). along with two implementation strategies. Under many conditions. and an O (n) algorithm to find minimal perfect hash functions for a set of keys. As this chapter will demonstrate. hashing is effective both in time and space.htm (1 of 103)7/3/2004 4:21:15 PM . lack of such knowledge can lead to large performance fluctuations. information regarding the number of keys that will be stored. The concepts underlying hashing are presented. an information storage and retrieval technique useful for implementing many of the other structures in this book. If these factors can be predicted with some reliability--and in an information retrieval system. and how it can be implemented. They have not covered the fundamental underlying algorithms for organizing the indices to these files. Lenwood Heath and Qi-Fan Chen Software Productivity Consortium. they usually can--then hashing can be used to advantage over most other types of retrieval algorithms. 13.1 INTRODUCTION Accessing information based on a key is a central operation in information retrieval. a ubiquitous information retrieval strategy for providing efficient access to information based on a key. they can be summarized by mentioning that it pays to know something about the data being stored. and stability of data. in particular. an important optimization in information retrieval. Other chapters in this book have addressed the issue. and also gives a history and bibliography of many of the concepts in this chapter.. Edward Fox. the location (or locations) of information that have been decided to be relevant to that value. Knuth (1973) provides a fuller treatment. Virginia Polytechnic Institute and State University Abstract This chapter discusses hashing. However. 13. these chapters have been concerned with file-level concerns. The material in this chapter stresses the most important concepts of hashing.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS CHAPTER 13: HASHING ALGORITHMS Steven Wartik. This chapter discusses hashing. Some important theory for hashing is presented. for instance). technical text. An information retrieval system must determine.

Assume the domain of keys has N possible values. If keys are drawn from the set of strings of letters up to 10 characters long--too short to be of use in information retrieval systems-. Consider the problem first from the performance standpoint. From the standpoint of compactness. (The exact value depends on the implementation technique.. As discussed above. that is. The best performance is therefore achieved by having N = m. All keys collide. The concept underlying hashing is to choose m to be approximately the maximum expected value of n.) Since . Collisions therefore degrade performance. if keys are consecutive integers in the range (N1. Let n be the number of keys actually stored. and so on. A collision occurs when two or more keys map to the same location. when the number of values exceeds the number of locations in which they can be stored. These two mappings can be computed in constant time. An implementation of a mapping with m = N is impossible: no existing computer contains enough memory. In most applications. If no keys collide.htm (2 of 103)7/3/2004 4:21:15 PM .then . that is. However. to tune both simultaneously so as to achieve a reasonably low number of collisions together with a reasonably small amount of unused space.' a ' ) * (k [ 1 ] .. The mapping that is best suited to performance is wasteful of space. The mapping between these two domains should be both quick to compute and compact to represent. lowering the probability of collision.ooks_Algorithms_Collection2ed/books/book5/chap13. any such strategy will require extra computation. the ideal representation would have m = 1. then m = 26 (using C character manipulation for ASCII) is (k [ 0 ] . Defining such a mapping is easy. The mapping involved in hashing thus has two facets of performance: number of collisions and amount of unused storage. The approach behind hashing is to optimize both. no application ever stores all keys in a domain simultaneously unless the size of the domain is small. and using a 1: 1 mapping between keys and locations. The problem is that the mapping is defined over all N values. . over all possible keys in a very large domain. The goal is to avoid collisions. If keys are two-character strings of lowercase letters. and a strategy must exist to resolve collisions.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS The problem at hand is to define and implement a mapping from a domain of keys to a domain of locations. N2). and indeed are expected in practice. Now consider compactness.N1. integers. and are therefore ideal from a performance standpoint. that is. no matter how many keys are in the table. in theory many collisions are possible. n is usually sufficiently small that memory consumption is not excessive. then locating the information associated with a key is simply the process of determining the key's location. an attempt is made to choose a mapping that randomly spreads keys throughout the locations. Collisions are always possible whenever N > m. The domain of locations is usually the m integers between 0 and m . It requires only a little knowledge of the representation of the key domain.N1 + 1 and the mapping on 26. so only a few of the many possible keys are in use at any time. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. some extra computation is necessary to further determine a unique location for a key.1. then m = N2 .' a ' ) . In practice. and the mapping a key k is k . inclusive. that is. also. For example. Optimization of one occurs at the expense of the other. The domain of keys can be any data type--strings. Whenever a collision occurs. all locations would be in use.

as sections 13. Each bucket is indexed by an integer between 0 and m . and 64 have been inserted. how to choose a hash function. and whose range is between 0 and m . meaning that it corresponds to no key in use.) The mapping between a key and a bucket is called the hash function.3 discusses hash functions. Retrieving information requires using the hash function to compute the bucket that contains it. (Whether "not empty" is equivalent to "full" depends on the implementation. Section 13. The space required to store the elements is that required for an array of m elements. A hash table is best thought of as an array of m locations. making performance hard to predict. This is a function whose domain is that of the keys. Section 13. that information has been placed in it. Mapping a key to a bucket is quick. however. 1 is stored in bucket 1 because 1 mod 5 = 1. Even with fewer than m keys. Figure 13. where k is an integer.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS The information to be retrieved is stored in a hash table. or not empty.. The ideal function.4. suppose a hash table is to be implemented with 5 buckets. Retrieval times become nonuniform. rather than resolving.. choosing a hash function that distributes keys uniformly is sufficiently difficult that collisions can be expected to start occurring long before the hash table fills up. inclusive. Figure 13. and a hash function h(k) = k mod 5.4 presents two implementation techniques and the collision resolution strategies they employ. and so on. would distribute all elements across the buckets such that no collisions ever occurred. Indeed. and can be calculated in constant time. a technique for avoiding. collisions. This scheme works well as long as there are no collisions. what to do when a collision occurs. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. called buckets. In practice. Typically.1 shows the hash table after the keys 1. this should not be a problem.1: Visual illustration of a hash table 13.3 HASHING FUNCTIONS The choice of hash function is extremely important.1. and second. The time to store and retrieve data is proportional to the time to compute the hash function. and placing the information in that bucket. this is desirable for performance reasons. Unless m is very large. is used to map keys into the table.htm (3 of 103)7/3/2004 4:21:15 PM .1. 2. A bucket can be empty. 2 is stored in bucket 2 because 2 mod 5 = 2. As discussed in the previous section. Section 13.1 and 13. and that subsequent attempts to place information in it based on another key will result in a collision.2 show. meaning that it does correspond to a key in use.ooks_Algorithms_Collection2ed/books/book5/chap13. For example. this function is very simple.4.5 explains perfect hash functions. Storing a value in a hash table involves using the hash function to compute the bucket to which a key corresponds. but all collisionresolution strategies take extra time--as much as (m + n). termed a perfect hash function. storing more than m distinct keys into a hash table with m will always cause collisions. This suggests two areas for further study: first. A perfect hash function guarantees uniform performance. collisions should be expected.

Use the square of a few of the middle bytes. however. â(v) for a nonempty charcter string might be the ASCII character-code value of the last character of the string. If all keys are to be powers of two. For example. a hash table is still only as good as its hash function. The last method. This suggests that the hash function should be of the form: h(v) = f(v) mod m since modulo arithmetic yields a value in the desired range. and a failure to account for this bias can ruin the effectiveness of the hash table.1 shows that making m a prime number is not always desirable. though. It is usually better to treat v as a sequence of bytes and do one of the following for f(v): 1.1.. but not much so. As another example. Empirical evidence has found that this choice gives good performance in many situations. and a and k are small. (However. Every possible value has a unique bucket. Such simple-minded schemes are usually not acceptable.1 is storing integers that are all powers of two. 2. where m is the number of buckets. Figure 13. if any are known. In many computers.htm (4 of 103)7/3/2004 4:21:15 PM . the last character is better. Use the last (or middle) byte instead of the first. 3. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Bucket quantity must be considered carefully.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS Perfect hash functions are extremely hard to achieve. then h(k) = k mod 5 is a poor choice: no key will ever map to bucket 0. The problems of using the first character of a string have already been discussed. f(v) might be the integer value of the first byte. and are generally only possible in specialized circumstances. suppose a hash table is be used to store last names. This illustrates an important point: the more buckets a hash table contains.. Another such circumstance was illustrated in Chapter 12: the bit-vector implementation of a set illustrates a perfect hash function. the hash table in Figure 13. Knuth (1973) suggests using as the value for m a prime number such that rk mod m = a mod m. such as when the set of elements to hash is known in advance. the first type of a floating-point word is the exponent.) The function f(v) should account for nuances in the data. A hash function's range must be an integer between 0 and m . Even with a large number of buckets. Overflow can be ignored. the better the performance is likely to be. where r is the radix of v. Sum or multiply all the bytes. The most important consideration in choosing a hash function is the type of data to be hashed. Data sets are often biased in some way. then the internal integer representation of its bits is used. For example. A 26-bucket hash table using a hash function that maps the first letter of the last name to the buckets would be a poor choice: names are much more likely to begin with certain letters than others (very few last names begin with "X"). called the "mid-square" method. so all numbers of the same magnitude will hash to the same bucket. For a four-byte floating-point number.ooks_Algorithms_Collection2ed/books/book5/chap13. can be computed quickly and generally produces satisfactory results. If v is not an integer. or the first portion thereof.

ooks_Algorithms_Collection2ed/books/book5/chap13. document names) is associated with a key. if any exists. 4. Figure 13. There. All techniques use the same general approach: the hash table is implemented as an array of m buckets. Compare this definition with the one given in Chapter 12.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS 13.4 IMPLEMENTING HASHING Once the hash function and table size have been determined.htm (5 of 103)7/3/2004 4:21:15 PM . /* with k. into a hash table. then it cannot be inserted. void Delete(hashTable *h. retrieve the information associated with it. of course. The definition of the hashTable type will be left to the following sections. Retrieval: given a key k. /* Make h empty. /* Insert i into h.1. This chapter considers the more general case. void Initialize(hashTab1e *h). Insertion: insert information. New information indexed by k may subsequently be placed in the table. keyed */ */ */ /* information associated */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. it remains to choose the scheme by which they shall be realized. since particular implementation techniques require different structures. /* Delete from h the */ */ /* by k. The data types of the key and the information are application-dependent. If the hash table already contains k. where other information (e. and information. Deletion: remove the information associated with key k from a hash table. void Insert(hashTable *h.g. Initialization: indicate that the hash table contains no elements.. the hash function must be targeted toward the key's data type. The following operations are usually provided by an implementation of hashing: 1. information i). key k. 2. indexed by a key k. (Some implementations do allow such insertion. numbered from 0 to m . The routines all operate on three data types: hashTable. key..) 3. the "information" associated with a key is simply the presence or absence of that key. to permit replacing existing information.2 shows a C-language interface that realizes the above operations. key k)..

ooks_Algorithms_Collection2ed/books/book5/chap13. performance will degrade as the number of keys increases. However. rather than a single one. In the worst case (where all n keys map to a single location). a chain--of key-information pairs. A hash table with m buckets may therefore store more than m keys.. then. the average time to locate an element will be proportional to n/2. key k). It is so named because each bucket stores a linked list--that is. so the expected time is still constant. 0 */ */ #define DUPLICATE_KEY 1 #define NO_SUCH_KEY #define TABLE_FULL int op_status(). on the assumption that information recently stored is likely to be accessed sooner than older information. Usually. most chains' lengths will be close to this ratio.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS information Retrieve(hashTable *h.1 Chained Hashing The first implementation strategy is called chained hashing. The solution to a collision. is straightforward.4. #define OKAY /* Retrieve from h the /* information associated */ /* These values are returned /* by op_status().3 shows how the hash table from Figure 13. */ Figure 13. (Usually it is placed at the head of the list. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 2 3 /* the result of the operation */ /* last executed. it is placed in the list associated with that bucket..) Figure 13.1 would appear if it was a chained table and all powers of two from 1 to 64 had been stored in it.htm (6 of 103)7/3/2004 4:21:15 PM . Figure 13. If a key maps to a bucket containing no information. indicating */ */ /* with k.3: Visual illustration of a chained hash table In this scheme.2: Interface to a hashing implementation 13. It is best not to use hashing when n is expected to greatly exceed m. a bucket may have any number of elements. the time will be proportional to n/ m. it is placed at the head of the list for that bucket. so n/m can be expected to remain linearly proportional to m. Computing the bucket in which a key resides is still fast--a matter of evaluating the hash function--but locating it within that bucket (or simply determining its presence. this is also simplest to implement. which is necessary in all operations) requires traversing the linked list. If a key maps to a bucket that already has information. In the best case (where all chains are of equal length).

key. each bucket has a field for information. information. Most of the code from the routines Insert. a sketch is given of the differences. *buckets. the specifics of information and key are application-dependent. is replaced with: Number_Of_Buckets. }. As in Chapter 12. i.htm (7 of 103)7/3/2004 4:21:15 PM . which shows how this information may be associated with the table. it will not be repeated here. and Member (for Retrieve) can be used directly. Hence.. (*hashing_function)(). typedef . struct be_str *next_datum. Since the code is so similar. typedef struct { int bucket_element int boolean } hashTable.ooks_Algorithms_Collection2ed/books/book5/chap13.. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. This is easily illustrated by considering the new data types needed. The line in Insert: New_Element->datum = e. and so are not given here. Instead.. This routine had as parameters the hash function and an element-comparison routine..Information Retrieval: CHAPTER 13: HASHING ALGORITHMS The hashing-based implementation of Boolean operations in Chapter 12 uses chains.. Chapter 12 also discussed the need for a routine to create the table. Delete. The main difference from the routines in Chapter 12 lies in the need to associate the information with its key in a bucket. (*comparator)(). typedef struct be_str { key information k.. Clear (for Initialize). The hash table is declared as follows: typedef .

Worse. for instance) that. By the time key 512 is inserted. This is the philosophy behind open addressing. the table is as in Figure 13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS New_Element->k = k. with all buckets except 0 full. sometimes the maximum number of keys is known. so it is inserted in bucket 3.4. when a collection of data is expected to remain static for a long period of time (as is often true in information retrieval). Figure 13. the keys associated with the data can be determined at the time the data is stored. When a collision is detected at bucket b. 128.. Clustering greatly increases the average time to locate keys in a hash table.ooks_Algorithms_Collection2ed/books/book5/chap13. New_Element->i = i. linear probing uses an average of less than five probes. for instance. then bucket 0 is tried. associated information. 13. bucket b + 1 is "probed" to see if it is empty. Many strategies to select another bucket exist.htm (8 of 103)7/3/2004 4:21:15 PM . Key 32 collides with key 2. (That is..4: Clustering in a hash table file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. the key and information are stored there. and 512 are now inserted. which shows a 10-bucket hashing table containing powers of two between 1 and 16. consider Figure 13. In such a circumstance. the flexibility offered by chained hashing is wasted. In these cases. This illustrates a characteristic of linear probing known as clustering: the tendency of keys to cluster around a single location. Each bucket in the hash table can contain a single keyinformation pair. For example. 256.4 (a). the likelihood that the number of keys will exceed the number of buckets is quite low. If b + 1 = m. but for tables that are nearly full. In other words. and an indication of whether the bucket is empty. where h(v) = v mod 10. this sequence is duplicated each time 512 is retrieved. It has been shown (see Sedgewick [1990].4 (b). for a table less than two-thirds full.") Information is stored in the table by computing the hash function value of its key. A technique tailored to the amount of storage expected to be used is preferable. This might happen. another bucket is (deterministically) selected. so it is inserted in bucket 5.2 Open Addressing Sometimes. called probing. Retrieving all other elements will take little time requiring at most one extra probe. The hash table is implemented as an array of buckets. If the bucket indexed by that value is empty. a bucket that is not empty is "full. The other routines differ analogously. Inserting key 512 necessitates the sequence of probes shown in Figure 13. repeats until an empty bucket is found. each being a structure holding a key. Suppose the keys 32. but retrieving 512 will be expensive. The simplest is linear probing. buckets are sequentially searched until an empty bucket is found. As another example. and the average number is expected to hover around that value. This process. If the bucket is not empty. 64. the average number of probes becomes proportional to m.4 (c). Key 64 collides with key 4.

4. This introduces some constraints on h2. since the hash function might have been tuned to the original bucket size. A commonly used scheme employs a formula called quadratic probing. however. As with linear probing. yields another bucket index. This extension is straightforward. however. . avoids collisions by clustering keys around the address to which they map. to place it in a state where no more information can be stored in it. where i is not a divisor of m. int (*hashing_function) (). (Note that linear probing does this: it is simply h2(b) = (b + 1) mod m). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. another probe will be tried if the new bucket is also full. A suitable definition for a hash table is: typedef struct { key k. In other words. ruining the suitability of nearby buckets. The hash table may be redefined to contain more buckets. The redistribution can be done using some function h2(b) that. when applied to a bucket index b.htm (9 of 103)7/3/2004 4:21:15 PM . There are. where i = 1. . successive applications of h2 should examine every bucket. This will not prevent collisions.ooks_Algorithms_Collection2ed/books/book5/chap13. open addressing is best when the maximum number of keys to be stored does not exceed m. Ideally. It can degrade performance.. In open addressing. Linear probing. but it will lessen the number of collisions due to clustering. .Information Retrieval: CHAPTER 13: HASHING ALGORITHMS The problem is that hashing is based on the assumption that most keys won't map to the same address. } bucket. bucket *buckets. What is desirable is a more random redistribution. but the results are usually quite satisfactory. typedef struct { int number_of_buckets. the original hash function having failed. . This uses a sequence of probes of the form h + i. strategies for coping when this situation arises unexpectedly. h2 = b + m would be a poor choice: it would always try the same bucket. a second one is applied to the bucket. bool empty. however. information i. A better scheme than linear probing is to redistribute colliding keys randomly. This will happen if h2(b) = (b + i) mod m. 9. h2 = b + m/i where m is a multiple of i would also be bad: it can only try m/i of the buckets in the table. As mentioned above.. because all that is needed is an array. for instance. This is not guaranteed to probe every bucket. it is possible to fill a hash table--that is. The data structures for implementing open addressing are simpler than for chained hashing.

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

int

(*comparator) ();

} hashTable; Each element of the table is a structure, not a pointer, so using a null pointer to indicate if a bucket is empty or full no longer works. Hence, a field empty has been added to each bucket. If the key is itself a pointer type, then the key field can be used to store this information. The implementation here considers the more general case. Except for probing, the algorithms are also simpler, since no linked list manipulation is needed. Figures 13.513.8 show how it is done. As before, the macro: #define hash(h, k) ((*((h)->hashing_function))(k) % (h)>number_of_buckets) is defined to abstract the computation of a bucket for a key. void Initialize(h) hashTable *h; { register int i; status = OKAY; for ( i = 0; i < h->number_of_buckets; i++ ) h->buckets[i].empty = TRUE; } Figure 13.5: Initializing an open-addressed hash table These routines all behave similarly. They attempt to locate either an empty bucket (for Insert) or a bucket with a key of interest. This is done through some probing sequence. The basic operations of probing are initialization (setting up to begin the probe), determining if all spots have been probed, and determining the next bucket to probe. Figure 13.9 shows how quadratic probing could be implemented. The probe is considered to have failed when (m + 1)/2 buckets have been tried. In practice, this value has been shown to give good results.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (10 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

void Insert(h, k, i) hashTable key k; *h;

information i; { register int b; b = hash(h, k); status = OKAY; Initialize_Probe(h); while ( ! h->buckets[b].empty && ! Probe_Exhausted(h) ) { if ( (*h->comparator)(k, h->buckets[b].k) ) { status = DUPLICATE_KEY; return; } b = probe(h); } if ( h->buckets[b].empty ) { h->buckets[b].i = i; h->buckets[b].k = k; h->buckets[b].empty = FALSE; } else
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (11 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

status = TABLE_FULL; } Figure 13.6: Inserting into an open-addressed hash table void Delete(h, k) hashTable key { register int b; status = OKAY; b = hash(h, k); Initialize_Probe(h); while ( ! h->buckets[b].empty && ! Probe_Exhausted(h) ) { if ( (*h->comparator)(k, h->buckets[b].k) ) { h->buckets[b].empty = TRUE; return; } b = probe(h); } status = NO_SUCH_KEY; } Figure 13.7: Deleting from an open-addressed hash table *h; k;

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (12 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

information Retrieve(h, k) hashTable key { register int b; status = OKAY; b = hash(h, k); Initialize_Probe(h, b); while ( ! h->buckets[b].empty && ! Probe_Exhausted(h) ) { if ( (*h->comparator)(k, h->buckets[b].k) ) return h->buckets[b].i; b = probe(h); } status = NO_SUCH_KEY; return h->buckets[0].i; } Figure 13.8: Retrieving from an open-addressed hash table static int number_of_probes; static int last_bucket; /* Return a dummy value. */ *h; k;

void Initialize_Probe(h, starting_bucket) hashTable *h;

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (13 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

int {

starting_bucket;

number_of_probes = 1; last_bucket = starting_bucket; } int probe(h) hashTable { number_of_probes++; last_bucket = (last_bucket + number_of_probes*number_of_probes) % h->number_of_buckets; return last_bucket; } bool Probe_Exhausted(h) hashTable *h; { return (number_of_probes >= (h->number_of_buckets+1)/2); } Figure 13.9: Quadratic probing implementation *h;

13.5 MINIMAL PERFECT HASH FUNCTIONS
Section 13.3 mentioned that the ideal hash function would avoid collisions by mapping all keys to distinct locations. This is termed a perfect hash function. A perfect hash function is ideal from a performance
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (14 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

standpoint in that the time to locate the bucket corresponding to a particular key is always the time needed to compute the hash function. This predictability improves the ability to precisely infer performance characteristics. Perfect hash functions are possible, but generally only when the set of keys to be hashed is known at the time the function is derived. Best of all under this condition is a minimal perfect hash function, a perfect hash function with the property that it hashes m keys to m buckets with no collisions. Not only is performance optimized, but no space is wasted in the hash table. In general, it is difficult to find a MPHF. Knuth (1973) observes that only one in 10 million functions is a perfect hash function for mapping the 31 most frequently used English words into 41 addresses. Minimal perfect hash functions are rarer still. This section presents an algorithm for finding minimal perfect hash functions for a given set of keys. The algorithm is not guaranteed to work, but is almost always successful. Before explaining the algorithm, it will be helpful to give some background on the topic.

13.5.1 Existence Proof
One might ask whether a minimal perfect hash function (hereafter abbreviated MPHF) h exists for a set of keys. Jaeschke (1981) proves that the answer is yes. Consider the problem of mapping a set of m positive integers, bounded above by N without collisions into a hash table T with m buckets. The following algorithm defines a suitable MPHF: Store the keys in an array k of length m. Allocate an array A of length N, and initialize all values to ERROR. for ( i = 1; i < m; i++ ) A[k[i]] = i The array A defines h: allowable keys map into addresses {0, ..., m - 1} and other keys map into ERROR. This defines a MPHF, but the array A is mostly empty; since usually m N, the hash function occupies too much space to be useful. In other words, efficient use of storage in hashing encompasses the representation of the hash function as well as optimal use of buckets. Both must be acceptably small if minimal perfect hashing is to be practical.

13.5.2 An Algorithm to Compute a MPHF: Concepts
The scarcity of suitable functions suggests that it is best to search function spaces for them using computer programs. There are several strategies for doing so. The simplest is to select a class of functions that is likely
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (15 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

to include a number of minimal perfect hash functions, and then search for a MPHF in that class by assigning different values to each of the parameters characterizing the class. Carter and Wegman (1979) introduced the idea of a class H of functions that are universal2, that is, where no pair of distinct keys collides very often. By random selection from H, one can select candidate functions and expect that a hash function having a small number of collisions can be found quickly. This technique has been applied to dynamic hashing by Ramakrishna and Larson (1989). Sprugnoli (1978) proposes two classes of functions, one with two and the other with four parameters, that each may yield a MPHF, but searching for usable parameter values is feasible only for very small key sets. Jaeschke (1981) suggests a reciprocal hashing scheme with three parameters, guaranteed to find a MPHF, but only practical when m 20. Chang (1986) proposes a method with only one parameter. Its value is likely to be very large, and a function is required that assigns a distinct prime to each key. However, he gives no algorithm for that function. A practical algorithm finding perfect hash functions for fairly large key sets is described by Cormack et al. (1985). They illustrate trade-offs between time and size of the hash function, but do not give tight bounds on total time to find PHFs or experimental details for very large key sets. The above-mentioned "search-only" methods may (if general enough, and if enough time is allotted) directly yield a perfect hash function, with the right assignment of parameters. However, analysis of the lower bound on the size of a suitable MPHF suggests that if parameter values are not to be virtually unbounded, then there must be a moderate number of parameters to assign. In the algorithms of Cichelli (1980) and of Cercone et al. (1983) are two important concepts: using tables of values as the parameters, and using a mapping, ordering, and searching (MOS) approach (see Figure 13.10). While their tables seem too small to handle very large key sets, the MOS approach is an important contribution to the field of perfect hashing. In the MOS approach, construction of a MPHF is accomplished in three steps. First, the mapping step transforms the key set from an original to a new universe. Second, the ordering step places the keys in a sequence that determines the order in which hash values are assigned to keys. The ordering step may partition the order into subsequences of consecutive keys. A subsequence may be thought of as a level, with the keys of each level assigned their hash values at the same time. Third, the searching step assigns hash values to the keys of each level. If the Searching step encounters a level it is unable to accommodate, it backtracks, sometimes to an earlier level, assigns new hash values to the keys of that level, and tries again to assign hash values to later levels.

Figure 13.10: MOS method to find perfect hash functions

13.5.3 Sager's Method and Improvement
Sager (1984, 1985) formalizes and extends Cichelli's approach. Like Cichelli, he assumes that a key is given as a character string. In the mapping step, three auxiliary (hash) functions are defined on the original universe of keys U:

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (16 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

h0: U h1: U h2: U

{0, . . . , m -- 1} {0, . . . , r -- 1} {r, . . . , 2r -- 1}

where r is a parameter (typically m/2) that determines the space to store the perfect hash function (i.e., |h| = 2r). The auxiliary functions compress each key k into a unique identifier (h0(k),h1(k),h2(k)) which is a triple of integers in a new universe of size mr2. The class of functions searched is h(k) = (h0(k) + g(h1(k)) + g(h2(k)) (mod m) (1) where g is a function whose values are selected during the search. Sager uses a graph that represents the constraints among keys. The mapping step goes from keys to triples to a special bipartite graph, the dependency graph, whose vertices are the h1(k) and h2(k) values and whose edges represent the words. The two vertex sets of the dependency graph are {0, . . . , r - 1} and {r, . . . , 2r 1}. For each key k, there is an edge connecting h1(k) and h2(k), labeled by k. See Figure 13.11. Note that it is quite possible that some vertices will have no associated arcs (keys), and that some arcs may have the same pairs of vertices as their endpoints.

Figure 13.11: Dependency graph In the ordering step, Sager employs a heuristic called mincycle that is based on finding short cycles in the graph. Each iteration of the ordering step identifies a set of unselected edges in the dependency graph in as many small cycles as possible. The set of keys corresponding to that set of edges constitutes the next level in the ordering. There is no proof given that a minimum perfect hash function can be found, but mincycle is very successful on sets of a few hundred keys. Mincycle takes O(m4) time and O(m3) space, while the subsequent Searching step usually takes only O(m) time. Sager chooses values for r that are proportional to m. A typical value is r = m/2. In the case of minimal perfect hashing (m = n), it requires 2r = n computer words of lg n bits each to represent g. Fox et al. (1989a)
file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (17 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

have shown, using an argument due to Mehlhorn (1982), that a lower bound on the number of bits per key needed to represent a MPHF is approximately 1.4427. Sager's value is therefore somewhat higher than the optimal. To save space, the ratio (2r/n) must be reduced as low as possible, certainly below 1. Early work to explore and improve Sager's technique led to an implementation, with some slight improvements and with extensive instrumentation added on, described by Datta (1988). Further investigation by Fox et al. (1989a) yielded a modified algorithm requiring O(m3) time. This algorithm has been used to find MPHFs for sets of over a thousand words. One thousand word key sets is good but still impractical for many information retrieval applications. As described in Fox et al. (1989b), Heath subsequently devised an O(m log m) algorithm, which is practical for large sets of keys. It is based on three crucial observations of previous work: 1. Randomness must be exploited whenever possible. The functions suggested by Sager do not yield distinct triples in the mapping stage with large key sets. Randomness can help improve this property. 2. The vertex degree distribution in the dependency graph is highly skewed. This can be exploited to make the ordering step much more efficient. Previously, it required up to O(m3) time; this observation reduces it to O(m log m). 3. Assigning g values to a set of related words can be viewed as trying to fit a pattern into a partially filled disk, where it is important to enter large patterns while the disk is only partially full. Since the mapping and searching steps are O(m), the algorithm is O(m log m) with the improved ordering step.

13.5.4 The Algorithm
This section presents the algorithm. It is described in terms of its three steps, plus the main program that fits these steps together. A complete implementation is too large to fit comfortably into this chapter, but it is included as an appendix.

The main program
The main program takes four parameters: the name of a file containing a list of m keys, m, a ratio for determining the size of the hash table, and the name of a file in which the output is to be written. It executes each of the three steps and, if they all succeed, creates a file containing a MPHF for the keys. Figure 13.12 outlines the main routine. The main program is responsible for allocating enough arcs and vertices to form the dependency graph for the algorithm. "arcs" is an array of the arcs, and "vertices" an array of the vertices. Each arc corresponds to exactly one key. The data structures associated with arcs are as follows: typedef struct arc { typedef struct {

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (18 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

int h0, h12[2]; struct arc *next_edge[2]; } arcType; main(argc, argv) int char { arcsType arcs; argc; *argv[];

int no_arcs; arcType *arcArray; } arcsType;

verticesType vertices; int seed;

allocate_arcs( &arcs, atoi(argv[2]) ); allocate_vertices( &vertices, atoi(argv[2]) * atof(argv[3]) ); if ( mapping( arcs, vertices, &seed, argv[1] ) == NORM ) { ordering( arcs, vertices ); if ( searching ( arcs, vertices ) == NORM ) write_gfun ( vertices, seed, argv[4] ); } } Figure 13.12: Main program for MPHF algorithm The arcType structure stores the ho, h1 and h2 values for an arc, plus two singly linked lists that store the vertices arcs incident to the vertices h1 and h2. The arcsType structure is an array of all arcs, with a record of how many exist.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (19 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

The data structures for vertices have the form: typedef struct { int g, pred, succ; struct arc *first_edge; } vertexType; vsTail, rlistHead; vertexType* vertexArray; } verticesType; In vertexType, the first_edge field is the header of a linked list of arcs incident to the vertex (the next_edge values of arcType continue the list). The pred and succ fields are a doubly linked vertex list whose purpose is explained in the ordering stage. The g field ultimately stores the g function, computed during the searching stage. To save space, however, it is used for different purposes during each of the three stages. The dependency graph created by the mapping step has 2r vertices. (The value of r is the product of m and the ratio supplied as the third parameter to the main routine. It is therefore possible to exert some control over the size of the hash function. A large value of r increases the probability of finding a MPHF at the expense of memory.) The variable vertices, of type verticesType, holds all vertices. The fields of vertices wil1 be explained as they are used in the various steps. typedef struct { int no_vertices, maxDegree, vsHead,

The mapping step
The code for the mapping step is shown in Figure 13.13. The step is responsible for constructing the dependency graph from the keys. This is done as follows. Three random number tables, one for each of ho, h1, h2, are initialized. The number of columns in the table determines the greatest possible key length. The number of rows is currently 128: one for each possible ASCII character. (This is not strictly necessary but helps exploit randomness.) Next, the routine map_to_triples maps each key k to a triple (ho, h1, h2) using the formulas: int mapping( key_file, arcs, vertices, seed ) char *key_file; /* in: name of file containing keys. */

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (20 of 103)7/3/2004 4:21:15 PM

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS

arcsType

*arcs;

/* out: arcs in bipartite graph. /* out: vertices in bipartite graph.

*/ */

verticesType *vertices; int *seed;

/* out: seed selected to initialize the */ */

/* random tables. { int mapping_tries = 0;

randomTablesType randomTables; /* Three random number tables. while ( mapping_tries++ < MAPPINGS ) { initialize_arcs( arcs ); initialize_vertices( vertices ); initialize_randomTable( randomTables, seed );

map_to_triples( key_file, arcs, vertices->no_vertices/2, randomTables ); if ( construct_graph(arcs, vertices) == NORM ) return(NORM); } return(ABNORM); } Figure 13.13: The mapping routine int construct_graphs (arcs, vertices) arcsType *arcs; /* in out: arcs. /* in out: vertices. */ */

verticesType *vertices;

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD...ooks_Algorithms_Collection2ed/books/book5/chap13.htm (21 of 103)7/3/2004 4:21:15 PM

first_edge.next_edge[j] = /* Update vertex degree */ /* count and adjacency /* list.g++.h12[j]. i < arcs->no_arcs. /* j = 0 and 1 for h1 and h2 side.ooks_Algorithms_Collection2ed/books/book5/chap13. */ if ( vertices->vertexArray[vertex].Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { int i.g. arcs->arcArray[i]. */ */ /* Iterator over all arcs. */ . j. */ */ vertices->vertexArray[vertex]. for ( j = 0. status = NORM. vertex..htm (22 of 103)7/3/2004 4:21:15 PM /* Duplicate found. break. i++ ) { vertex = arcs->arcArray[i]. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. if ( (j == 0) && check_dup( &arcs->arcArray[i] ) == ABNORM ) { status = ABNORM. } } return(status).. vertices->vertexArray[vertex]. j++ ) { for ( i = 0.first_edge = &arcs ->arcArray[i]. } /* Figure out the maximal degree of the graph. j < 2. vertices->vertexArray[vertex]. respectively.g > vertices->maxDegree ) vertices->maxDegree = vertices->vertexArray[vertex].

The step produces an ordering of the vertices of the dependency graph (excluding those of degree 0. A key k corresponds to an edge k.1. This value rapidly approaches 1 as m grows. Because the triples are generated using random numbers. vt..14 shows the routine construct_graph(). it is possible for two triples to be identical. .1. the vertices associated with it are the values stored in h12[0] and h12[1]. If this happens. which builds the dependency graph. 2r .Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } Figure 13. The mapping step then builds the dependency graph. then the level of keys K file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. .htm (23 of 103)7/3/2004 4:21:15 PM . which do not correspond to any key). . each function h1 is computed as the sum of the random numbers indexed by the ASCII values of the key's characters. The arcs have already been built (in map_to_triples()). There is never a guarantee that all triples will be unique. If the vertex ordering is v1. To prevent infinite loops. The triple for the ith key in the input file is saved in the ith entry in the arcs array. . Half of the vertices in the dependency graph correspond to the h1 values and are labeled 0. . The degree counts of these vertices are incremented.14: The construct_graph routine (2) (3) (4) That is. and the mapping step repeated. . From this ordering. Figure 13. . The ordering step The goal of the ordering step is to partition the keys into a sequence of levels. . although duplicates are fairly rare in practice--it can be demonstrated to be approximately r2/2m3.. The other half correspond to the h2 values and are labeled r. r . There may be other edges between h1(k) and h2(k). what remains is to build the vertex array. between the vertex labeled h1(k) and h2(k). . There is one edge in the dependency graph for each key.ooks_Algorithms_Collection2ed/books/book5/chap13. but they are labeled with keys other than k. . . . the partition is easily derived. Each arc is searched. the mapping step is never attempted more than a fixed number of times. a new set of random number tables must be generated. and the incidence lists associated with each vertex are updated.

/* in out: the arcs.15 shows the ordering step's implementation. if r vi < 2r. then K(vi) = {kj|h1(kj) = vi. The first are those that start a new component graph to be explored. if 0 vi < r. and whose elements are linked via the succ and prec fields of a vertexType. The heuristic used to order the vertices is analogous to the algorithm by Prim (1957) for constructing a minimum spanning tree. For reasons discussed in the section on the searching step. More formally. s < i} (6) The rationale for this ordering comes from the observation that the vertex degree distribution is skewed toward vertices of low degree. Figure 13. */ */ *vertices. then K(vi) = {kj|h2(kj) = vi. 1 i t is the set of edges incident to both vi and a vertex earlier in the ordering. There are two types of vertices to consider.ooks_Algorithms_Collection2ed/books/book5/chap13. In the ordering step. the arc selected is the one that has maximal degree at the vertex not adjacent to any selected arcs. h1(kj) = vs. /* in out: the vertices. The vertex sequence is maintained in a list termed the "VS list. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." the head of which is the vsHead field of the vertices variable. This list. At each iteration of Prim's algorithm. vertices ) arcsType verticesType *arcs. The starting vertices tor other components are easily decided from the rList. whose head is in the rlistHead field of the vertices variable. the routine almost always correctly orders the vertices in the largest component first. To handle these. one is chosen randomly. the ordering step first builds (in the initialize_rList() routine) a doubly linked list of all vertices in descending order of degree. the selection process will not stop after the vertices in the first component graph are ordered. h2(kj) = vs. and to have any large levels early in the ordering..Information Retrieval: CHAPTER 13: HASHING ALGORITHMS (vi) corresponding to a vertex vi. Since the dependency graph may consist of several disconnected graphs. Ordering the vertices efficiently requires quickly finding the next vertex to be added into the VS list. s < i} (5) Similarly.htm (24 of 103)7/3/2004 4:21:15 PM . can be built in linear time because of the distribution of the degrees. void ordering( arcs. it will continue to process components until all vertices of nonzero degree have been put into the ordering. rather than in the smaller components (which would produce a degraded ordering). an arc is added to the minimum spanning tree that is lowest in cost such that one vertex of the arc is in the partially formed spanning tree and the other vertex is not. If several arcs are equivalent. Since the first vertex is most likely to be within the largest component of the graph.. it is desirable to have levels that are as small as possible. Rather.

side. */ */ /* Mark node "visited". vertex = &vertices->vertexArray[vertices->rlistHead]. arc = vertex->first_edge. /* Initialize the VS list. if ( vertex->first_edge != 0 ) { /* Add adjacent nodes that are not visited and /* not in virtual heap to the virtual heap. while ( vertices->rlistHead != -1 ) { /* Process each component graph.vertices->vertexArray >= vertices->no_vertices/2. initialize_rList( vertices ). delete_from_rList( vertex. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. vertices->no_vertices ).ooks_Algorithms_Collection2ed/books/book5/chap13. allocate_vheap( arcs->no_arcs.. */ vertices->vsHead = vertices->vsTail = NP. do { vertex->g = 0.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { int vertexType arcType degree. vertices ). */ initialize_vheap(). *vertex.. append_to_VS( vertex. vertices ).htm (25 of 103)7/3/2004 4:21:15 PM . */ side = vertex . while ( arc != 0 ) { int adj_node = arc->h12[(side+1)%2]. *arc.

normally O(log n). the skewed vertex distribution facilitates an optimization.. and added to the heap. All vertices adjacent to it are then added to the heap if they have not already been visited and are not already in the heap. at which time the next component in the graph is selected.. The heap therefore contains only vertices of degree greater than 5. stack i containing only vertices of degree i. is not stored on the heap. the expected time of the ordering step becomes O(n) as a result. Next. but adjacent to vertices in the ordering. It is marked visited (the g field is used to do so). vertices->vertexArray[adj_node]. However.15: The ordering step The second type of vertices to be added to the VS list are those adjacent to others already ordered. the vertex sequence has been created. } free_vheap().g *= -1. This process repeats until the heap is empty. } Figure 13. Each component of the graph is considered in a separate iteration of the main loop of ordering: The heap is first emptied. These are handled by keeping track of all vertices not yet ordered. added to the VS list. This becomes the next "visited" vertex. The time to perform this step is bounded by the time to perform operations on the heap. and the heap is empty.htm (26 of 103)7/3/2004 4:21:15 PM . a small fraction of the total number. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } arc = arc->next_edge[side]. and its adjacent vertices added to the heap.g) > 0 ) { add_to_vheap( &vertices->vertexArray[adj_node]. Vertices can be pushed onto and popped from a stack in constant time. as above. When all components have been processed. it is deleted from the rList. 1 i 5. } } } while ( max_degree_vertex( &vertex ) == NORM ). The algorithm is as follows. Instead.ooks_Algorithms_Collection2ed/books/book5/chap13. five stacks are used. A heap is used to do this. degree ). max_degree_vertex() removes from the heap a vertex of maximal degree with respect to others in the heap.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS if ( (degree = vertices->vertexArray[adj_node]. A vertex of degree i. and a vertex of maximal degree is extracted from the rList.

/* Condition variable. The levels are easily determined from the ordering in the VS list using equations (5) and (6). The max_degree_vertex() routine obtains a vertex of maximum degree by first searching the heap. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. */ /* Running count of searching tries. The add_to_vheap() routine takes both a vertex and its degree as parameters and can therefore determine whether to place the vertex on the heap or in a stack. it searches stacks 5. the partition into levels therefore contains larger groups of interrelated words early on. if the heap is empty. /* Simulated hash table. char */ intSetType */ slotSet. The implementations of these two routines are based on well-known algorithms for implementing stacks and heaps. what remains is to determine g. h1 and h2 have already been computed. Note that the ordering step does not actually yield a data structure of physically distinct levels. */ */ *arcs. in that order.ooks_Algorithms_Collection2ed/books/book5/chap13.htm (27 of 103)7/3/2004 4:21:15 PM . *vertices. Since h0. status = ABNORM. /* Each vertex in the VS list. 2. The VS list contains vertices with higher degrees first. one level at a time. 3. int searching( arcs. and 1. 4. It is not shown here but can be described quite simply. for instance. The algorithm to compute g is again based on the insight into vertex degree distribution. this is simply a concept that is useful in understanding the algorithm. /* Set of hash addresses.. *disk. /* Table of primes for pattern shifts.. vertices ) arcsType verticesType { int i. the cases most likely to be troublesome are eliminated first. searching_tries = 0. primes. If these are processed first. The searching step The search step scans the VS list produced by the ordering step and tries to assign hash values to the keys.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS This optimization detail is hidden by the vheap (virtual heap) module. see Sedgewick (1990). The hash function ultimately used has the form given in equation (1).

vertices->vertexArray[i].htm (28 of 103)7/3/2004 4:21:15 PM . &primes. while ( (searching_tries++ < SEARCHINGS) && (status == ABNORM) ) { status = NORM. sizeof(char) ).succ. */ initialize_search( arcs. slotSet. if ( fit_pattern(arcs. i = vertices->vsHead.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS disk = (char*) owncalloc( arcs->no_arcs.ooks_Algorithms_Collection2ed/books/book5/chap13. &slotSet) == ABNORM ) { status = ABNORM. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. return(status).. vertices. intSetRep = (int*) owncalloc( vertices->maxDegree. */ /* Search failed at vertex i. Proceed to next node. } } free( disk ). free( (char *)slotSet. Try */ /* a new pattern. */ */ i = vertices->vertexArray[i]. disk ). vertices. while ( i != NP ) { /* Fit keys of level of vertex i onto the disk. initialize_primes( arcs->no_arcs. /* Get the highest-level vertex.intSetRep ). free( (char *)primes.prec = VISIT..intSetRep ). &primes ). i. break. } else /* Search succeeded. disk. sizeof(int) ).

An empty disk is allocated. This is the task of fit_pattern(). As the g function is computed for a key. This means that all possible values for g can be tested in at most m tries. empirically. All keys at the level of i must be fit onto the disk. The initialize_search() routine initializes it to an array of m slots (since the hash function is to be minimal as well as perfect). If fit_pattern () detects a collision for some k K(vi). and slot h in disk is marked FULL. so a fixed number are generated. however. The disk variable in searching () is used for this purpose. Any change to g will shift the hash addresses of all keys associated with a given level. It passes this table to fit-pattern(). it computes a new pattern--that is.16: The searching step The simplest way to assure that the keys are correctly fit into a hash table is to actually build a hash table. which randomly chooses one value for the table to be used as the shift.htm (29 of 103)7/3/2004 4:21:15 PM . It marks each hash address for the examined keys as FULL in disk. each of which is set to EMPTY.ooks_Algorithms_Collection2ed/books/book5/chap13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } Figure 13. it examines each arc in K(vi) to see if the arc's key is at the current level (this is indicated by whether or not the vertex adjacent to i had already been visited).. the hash function's value for all k K(vi) is determined in part by the value of g for vi. and terminates. Each repetition works as follows. a new value for g for vertex i. All keys on the level will therefore be shifted to new addresses. the prec field holds this information). Ideally. then fit-pattern() has succeeded for this level. The searching() routine computes a table of prime numbers before computing hash addresses.. all vertices are marked as not visited (in this step. Computing m primes is expensive. This observation provides the motivation for the action to take when a collision is detected. Twenty small primes have been shown. If all keys fit. to give satisfactory results. The g value determines what might be considered a "pattern. Given vertex i and disk. and the g field of each vertex is set to a random value. If so. The implementation of the searching step is shown in Figure 13. it determines if the current values of g for vertex i and the vertices adjacent to i make any keys hash to an already-filled spot. Using a random value for g often contributes to a high probability of fitting all keys. the step is repeated up to a constant number of times. which behaves as follows. The formula used to compute the new value for g is: g ( g + s) mod m where s is a prime number. Using the MPHF file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. each vertex should have a different prime. Because a searching step is not guaranteed to find a suitable set of values for g. The next vertex i in the VS list is then selected. the hash address h for the key is determined.16." Because all keys of a particular level are associated with some vertex vi.

all that is needed to compute its hash address is the code: arcType arc. the seed used to start random number generation. Recall that the mapping step included a call to the routine map_to_triples(). Somewhat more time is necessary to use the MPHF. and there are always advantages to being able to predict the time needed to locate a key.h0 + mphf->gArray[arc. Discussion The algorithm presented in this section provides a practical means to compute minimal perfect hash functions for small to extremely large sets of keys. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. it is a constant-time algorithm.ooks_Algorithms_Collection2ed/books/book5/chap13. and h2. Another program can then use the MPHF by: 1. on the average. hash = abs (arc. &arc ). h1. r. This is the approach used by the retrieval routine in the appendix.. the main routine writes the MPHF to a file. the overhead may not be justified. which are needed to recompute h0. Given a key. see Fox (1990).. both to initialize it and to compute a key's value.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS When the searching step is through. and r the number of vertices on one side of the dependency graph. and the values of g for each vertex. Indeed. compute_h012 ( no_arcs. then.htm (30 of 103)7/3/2004 4:21:15 PM . much quicker.3. where tables contains the three tables. For very small data sets. Rereading the values of g. but for large data sets the algorithm in this section will be. previous approaches could not have computed an MPHF for a set that large in reasonable time. 2. Regenerating the three random number tables. h12[1]] ) % mphf->no_arcs. whose purpose was to compute the triples for all keys. compute_h012() is simply an implementation of equations (2)-(4).h12[0]] + mphf->gArray[arc. the algorithm has been verified on key set sizes of over one million. mphf the g values. All that is necessary is to write the size of the graph. One of its more interesting applications has been to compute the keys for an optical disk of information. The resulting hash function is much more complex than the approaches suggested in section 13. key. tables. This routine calls on compute_h012 () to actually compute the triples for a given key. In any case.

143-54. CORMACK. except for very large key sets. "Minimal and Almost Minimal Perfect Hash Function Search with Application to Natural Language Lexicon Design. 215 -31. HEATH.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS The MPHF uses a large amount of space. which must be present at both the computation and regeneration of the MPHF. 1980. the approach suggested in section 13. Va. CD-ROM published by Virginia Polytechnic Institute and State University. "Universal Classes of Hash Functions.5. Q." Communications of the ACM. KRAUSE. S. "Letter Oriented Reciprocal Hashing Scheme. KY. Department of Computer Science. N.. Department of Computer Science." J. 1979. 17-19. FOX. N. and J." Computers and Mathematics with Applications 9. CHEN. 1989a. BOATES..: TR 89-10. DATTA.. "An O (n log n) Algorithm for Finding Minimal Perfect Hash Functions. REFERENCES CARTER. Ostensibly. CHEN. CERCONE." Blacksburg." file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. which requires over 1/4 megabyte of storage. G. while having 128 rows helps exploit randomness." Paper presented at the Seventeenth Annual ACM Computer Science Conference. KAISERSWERTH. Louisville. 1981. E. Actually. M. J. and M. that is. A. JAESCHKE. 1988. and M. Virginia Polytechnic Institute and State University. G. A. FOX. L. CICHELLI. R. DATTA. "Practical Perfect Hashing. R. "Minimal Perfect Hash Functions Made Simple." The Computer Journal 28. Va. CHANG. N. E. C. "A More Cost Effective Algorithm for Finding Perfect Hash Functions. O (r). its size is proportional to the number of vertices in the dependency graph." Blacksburg. Ruckersville. 243-55. Each table has 128 rows and 150 columns. This is left as an exercise to the reader. Computer and System Sciences. an examination of the equations that access the tables (see the Mapping section step) shows that the actual number of columns used is no more than the length of the longest key. FOX.htm (31 of 103)7/3/2004 4:21:15 PM .: Technical Report TR-89-9 (Master's Report). 23." Information Sciences 38...1 might be more practical! However.. L. "Implementation of a Perfect Hash Function Scheme. C . "Reciprocal Hashing--a Method for Generating Minimal Perfect Hash Functions. HORSPOOL. V. S. 1985. J. The implementation could therefore be rewritten to use much less space. and L. A. 1989b. Q.. most space is consumed by the three random number tables. 54-58. this suggests that for small key sets. 1990. Moreover. E.ooks_Algorithms_Collection2ed/books/book5/chap13. Virginia Disc One. WEGMAN. Virginia Polytechnic Institute and State University. 1983. HEATH. 18. a MPHF can be found with fewer. and S.: Nimbus Records. 1986. Va.

"On the Program Size of Perfect and Universal Hash Functions. 1985. T.. 14. K." Paper presented at the 23rd IEEE Symposium on Foundations of Computer Science. M. Reading. Department of Computer Science. J. Algorithms in C. 231-63. 3: Sorting and Searching. 1990. T. The Art of Computer Programming. SPRUGNOLI. MEHLHORN. D." ACM Transactions on Database Systems. # # Directives: # phf Make phf. "A New Method for Generating Minimal Perfect Hashing Functions. KNUTH. SAGER. LARSON. with the file's name embedded in it. J.. It consists of nineteen files of source code. 841-50.: Technical Report CSc-84-15. 28. E. APPENDIX: MPHF IMPLEMENTATION What follows is a complete implementation of the minimal perfect hashing function algorithm.ooks_Algorithms_Collection2ed/books/book5/chap13." Communications of the ACM. Mass. V.: Addison-Wesley. 1984. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. RAMAKRISHNA. PRIM. 1957. "A Polynomial Time Generator for Minimal Perfect Hash Functions. The beginning of each file (except the makefile. R. "File Organization Using Composite Perfect Hashing.: Addison-Wesley. 20. SEDGEWICK." Bell System Technical Journal 36. a program to generate a MPHF. which come first) is marked by a comment consisting of a line of asterisks. R. SAGER. "Shortest Connection Networks and Some Generalizations.htm (32 of 103)7/3/2004 4:21:15 PM ." Communications of the ACM. Mo. C. R. 523-32. Vol.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS Communications of the ACM. 1989. 1982. # # Makefile for the minimal perfect hash algorithm." Rolla. 1973. 24. and P. Reading. 829-33. "Perfect Hashing Functions: a Single Probe Retrieving Method for Static Sets. plus a makefile (for the Unix utility make) containing compilation instructions.. University of Missouri-Rolla. 1978. Mass.

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS # # regen_mphf.a Make an object code library capable of regenerating an MPHF from the # specification # # # # # # # # # # # # # # # # # COMMON_OBJS= lint lint_phf lint_regen regression all (default) regen_driver file generated by phf. The phf program should terminate indicating success.htm (33 of 103)7/3/2004 4:21:15 PM .o file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.o pmrandom.o randomTables. no news is good news. compute_hfns. Make regen_driver.a Make the three above items.ooks_Algorithms_Collection2ed/books/book5/chap13. Execute a regression test.. Various flavors of consistency checking. The regen_driver program silently checks its results.o support.. a program to test the code in regen_mphf.

h types.c pmrandom.o regen_mphf.h regen_mphf.c regen_mphf.h randomTables.o ordering.o regen_driver.htm (34 of 103)7/3/2004 4:21:15 PM .c compute_hfns.ooks_Algorithms_Collection2ed/books/book5/chap13.c regen_mphf.c vheap.h pmrandom.o main.h support.o searching.c randomTables.c searching.c support.h \ main..o vheap.h MPHF_OBJS= MPHF_SRCS= MPHF_HDRS= REGEN_OBJS= REGEN_SRCS= REGEN_HDRS= RD_OBJS= RD_SRCS= PHFLIB= CFLAGS= -O LDFLAGS= LIBS= -lm compute_hfns.o mapping.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS COMMON_SRCS= COMMON_HDRS= const.c mapping.a all: phf regen_driver phf: $(PHFLIB) $(MPHF_OBJS) $(CC) -o phf $ (LDFLAGS) $ (MPHF_OBJS) $ (PHFLIB) -lm $(PHFLIB): ar r $ (PHFLIB) ranlib $ (PHFLIB) $(REGEN_OBJS) $ (COMMON_OBJS) $? file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.c vheap.c ordering..h regen_driver.

h pmrandom.o: pmrandom.o: const.h vheap.h compute_hfns.h support.h types.h const.o: ordering.h types.h types.h pmrandom.h randomTables.h support.o: vheap.htm (35 of 103)7/3/2004 4:21:15 PM .ooks_Algorithms_Collection2ed/books/book5/chap13.h support.h types.h support.h const.o: const.o: mapping.h support..h types.h const.h types. /regen_driver /tmp/hashing-output keywords > /tmp/hashed-words rm /tmp/hashing-output /tmp/hashed-words lint: lint_phf lint_regen lint_phf: lint $(MPHF_SRCS) $(COMMON_SRCS) $(LIBS) lint_regen: lint $(RD_SRCS) $(REGEN_SRCS) $(COMMON_SRCS) $(LIBS) compute_hfns. /phf keywords 'wc -l < keywords' 0.h randomTables.h pmrandom.h const.h pmrandom.8 /tmp/hashing-output .h *************************** /*************************** Purpose: External declarations for computing the three h functions file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.h types.h types.h vheap.h randomTables.o: const.o: const.h main.o: support.h support.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS regen_driver: $(RD_OBJS) $ (PHFLIB) $(CC) $ (LDFLAGS) -o regen_driver $(RD_OBJS) $(PHFLIB) $(LIBS) regression: phf regen_driver .h searching.h compute_hfns..

Notes: **/ #define MAX_INT ((unsigned)(~0)) >> 1 #define NP -1 #define NORM 0 /* Maximum integer. arcType *arc ). /* Null pointer for array-based linked lists. Define globally-useful constant values. */ None. Edited and tested by S. Chen and E. March 1991.htm (36 of 103)7/3/2004 4:21:15 PM . Fox.. Edited and tested by S. April 1991. None. March 1991. int r. Written and tested by Q. Chen and E. Wartik. Notes: **/ #ifdef __STDC__ extern void compute_h012( int n.ooks_Algorithms_Collection2ed/books/book5/chap13. */ /* Normal return. Provenance: Written and tested by Q. /* Abnormal return. char *key. #else extern void #endif /****************************** Purpose: Provenance: const. randomTablesType tables.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS associated with a key. April 1991.h ******************************* compute_h012().. Fox. */ */ #define ABNORM -1 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Wartik.

htm (37 of 103)7/3/2004 4:21:15 PM . Fox.ooks_Algorithms_Collection2ed/books/book5/chap13. Edited by S. */ #define NOTVISIT */ #define VISIT */ #define EMPTY */ #define FULL */ 0 /* Number of primes. '1' /* Indication of a filled slot in the disk. **/ #ifdef __STDC__ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Provenance: Written and tested by Q. '0' /* Indication of an empty slot in the disk. April 1991. pp. /***************************** Purpose: pmrandom. 1192-1201. It is taken from Park and Miller's paper.. Notes: The implementation is better than the random number generator from the C library. Wartik. "Random Number Generators: Good Ones are Hard to Find.h ***************************** External declarations for random-number generator package used by this program. March 1991. */ */ */ #define SEARCHINGS 10 #define MAX_KEY_LENG COLUMNS #define PRIMES 20 stage. /* Maximum length of a key.. used in searching /* Indication of an un-visited node.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS #define MAPPINGS 4 /* Total number of mapping runs /* Total number of searching runs. 1 /* Indication of a visited node." in CACM 31 (1988). Chen and E.

*/ extern int sequence. Written and tested by Q. pmrandom(). */ /* Rows of the random table (suitable for char). Wartik. /* Columns of the random table. Chen and E. /* Set the seed to a specified pmrandom(). April 1991. /* Get next random number in the getseed(). #define DEFAULT_SEED 23 randomTables.h **************************** /************************** Purpose: Provenance: External definitions for the three random number tables. */ #else extern void extern int extern int #endif setseed(int).ooks_Algorithms_Collection2ed/books/book5/chap13. */ extern int seed. getseed(). Edited and tested by S..htm (38 of 103)7/3/2004 4:21:15 PM . **/ #define NO_TABLES 3 #define ROWS 128 */ #define COLUMNS 150 /* Number of random number tables. March 1991. Fox. /* random number table */ #ifdef __STDC__ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Get the current value of the setseed()..Information Retrieval: CHAPTER 13: HASHING ALGORITHMS extern void value. */ typedef int randomTablesType[NO_TABLES] [ROWS] [COLUMNS].

Provenance: Written and tested by Q. Fox. /**************************** Purpose: regen_mphf. Edited and tested by S. /* Number of vertices used to compute MPHF. Chen and E. seed. /* The random number tables. */ no_arcs. int initialize_randomTable(). /* The array to hold g values. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. tables. Notes: **/ typedef struct { int */ int */ int tables. /* Number of keys (arcs) in the key set. Wartik.htm (39 of 103)7/3/2004 4:21:15 PM . None. April 1991. March 1991. #else extern void #endif initialize_randomTable( randomTablesType tables.h ***************************** External declarations for regenerating and using an already-computed minimal perfect hashing function.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS extern void *seed ). no_vertices.ooks_Algorithms_Collection2ed/books/book5/chap13.. #ifdef __STDC__ *gArray. /* The seed used for the random number randomTablesType */ int */ } mphfType..

.htm (40 of 103)7/3/2004 4:21:15 PM . retrieve ().h ****************************** External interface for support routines. int tbl_seed. write_gfun(arcsType *arcs. int size). release_mphf ( mphfType *mphf ). Notes: **/ #ifdef __STDC__ extern char extern char extern void *owncalloc(int n.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS extern int extern void extern int #else extern int extern void extern int #endif regen_mphf ( mphfType *mphf. char *key ).. retrieve ( mphfType *mphf. Fox. char *spec_file ). *ownrealloc(char *area. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. int new_size). release_mphf (). March 1991. char *spec_file). None. Written and tested by Q. verticesType *vertices. verticesType *vertices). Edited and tested by S. Chen and E. extern int #else verify_mphf(arcsType *arcs. /***************************** Purpose: Provenance: support. regen_mphf ().ooks_Algorithms_Collection2ed/books/book5/chap13. Wartik. April 1991.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. Fox.. h12[2]. write_gfun().htm (41 of 103)7/3/2004 4:21:15 PM .h" typedef struct arc { int h0. Written and tested by Q. g. March 1991. Wartik. *ownrealloc(). Chen and E. /* h2 values /* vertex data structure /* pointer to the first adjacent edge /* g value. /****************************** Purpose: Provenance: types. verify_mphf(). struct arc *next_edge[2]. Notes: **/ #include "const. April 1991. int prec. Edited and tested by S. /* pointer to arc sharing same h1 or */ } arcType.h ******************************* Define globally-useful data types.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS extern char extern char extern void extern int #endif *owncalloc().ooks_Algorithms_Collection2ed/books/book5/chap13. /* backward pointer of the vertex-list */ */ */ */ */ typedef struct { struct arc *first_edge. /* arc data structure /* h0 value /* h1 and h2 values */ */ */ None.

} intSetType. *intSetRep..h ******************************* Define a "virtual heap" module.htm (42 of 103)7/3/2004 4:21:15 PM . typedef struct { int count. maxDegree. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. arcType*arcArray. Wartik. Written and tested by Q.. /* forward pointer of the vertex-list */ typedef struct { int no_arcs.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS succ. rlistHead.ooks_Algorithms_Collection2ed/books/book5/chap13. April l99l. } arcsType. } vertexType. /* integer set data structure /* number of elements in the set /* set representation */ */ */ /****************************** Purpose: Provenance: vheap. vsHead. Edited and tested by S. } verticesType. Chen and E. /* arcs data structure /* number of arcs in the graph */ /* arc array */ */ /* vertices data structure /* number of vertices in the graph */ /* max degree of the graph /* VS list head /* VS list tail /* remaining vertex list head /* vertex array */ */ */ */ */ */ vertexType* vertexArray. typedef struct { int no_vertices. vsTail. March l99l. Fox.

allocate_vheap( int no_arcs.. max_degree_vertex ( vertexType **vertex ). It's tailored toward stacks and heaps of vertices and their degrees. add_to_vheap ( vertexType *vertex. **/ #ifdef _STDC_ extern void extern void extern void extern int extern void #else extern void extern void extern void extern int extern void #endif /***************************** compute_hfns. using a representation suitable for accessing them (in this case.ooks_Algorithms_Collection2ed/books/book5/chap13.c ***************************** allocate_vheap(). max_degree_vertex (). free_vheap(). free_vheap(). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. initialize_vheap(). add_to_vheap (). an integer index into the vertices->verex array identifies the vertex). int no_vertices ).Information Retrieval: CHAPTER 13: HASHING ALGORITHMS Notes: This isn't intended as a general-purpose stack/heap implementation.. initialize_vheap(). int degree ).htm (43 of 103)7/3/2004 4:21:15 PM .

tables. On return. Wartik.h> #include "types.h" #include "randomTables.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS Purpose: a key. */ /* in: pointer to the random tables.htm (44 of 103)7/3/2004 4:21:15 PM . fields of "arc" have the triple's values. key. /* in: size of h1 or h2 side of the graph. /* in: number of arcs. */ char *key. Notes: **/ #include <stdio. **/ void compute_h012(n. arc) int */ r.h" /************************************************************************* compute_h012( int. March l99l.. arcType* ) Return: Purpose: void Compute the triple for a key.h> #include <string. Chen and E. /* in: key string. randomTablesType tables. char*. int. randomTablesType. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Provenance: Computation of the three h functions associated with Written and tested by Q. Fox. n. Edited and tested by S.ooks_Algorithms_Collection2ed/books/book5/chap13. April l99l. the h0 and h12 None. r..

Notes: When compiled. Chen and E. driving the MPHF creation. and h2 according */ /* to the sums computed. Written and tested by Q. length. i++ ) characters */ for ( j = 0.. for ( i = 0. h1. /* Iterator over each table. i < NO_TABLES. j. */ /******************************** main. /* out: the key's arc entry. /* The length of "key". /* Running sum of h0. length = strlen(key) .. Wartik. j < length. arc->hl2[1] = abs( sum[2] ) % r + r. Edited and tested by S.c ********************************* Purpose: Provenance: Main routine.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS */ arcType */ { int i. April 1991. } /* of h0. */ */ */ */ *arc. sum[NO_TABLES]. arc->h0 = abs( sum[0] ) % n.ooks_Algorithms_Collection2ed/books/book5/chap13. */ arc->h12[0] = abs( sum[1] ) % r. sum[0] = sum[1] = sum[2] = 0 . the resulting program is used as follows: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Assign mappings for each */ /* Sum over all the /* in the key. March 1991. Fox. /* Iterator over each character in "key". h1 and h2 values.htm (45 of 103)7/3/2004 4:21:15 PM . j++ ) sum[i] += table[i][(key[j]%ROWS)][j].

It will contain the MPHF if one is found. free_arcs( arcsType* arcs ). giving a ratio between L and the size of the hashing function generated. allocate_arcs( arcsType* arcs.ooks_Algorithms_Collection2ed/books/book5/chap13. O Name of a file to be used as output.. allocate_vertices( verticesType* vertices. In general. verticesType *vertices ).. int n ).h> #include "types. **/ #include <stdio.h> #include <math. It should contain one or more newline-terminated strings.h" #include "support.htm (46 of 103)7/3/2004 4:21:15 PM .h> #include <string.0 is usually a viable value. A real number. int n ). L R The number of lines in I.h" #ifdef __STDC__ extern void extern void extern void extern void ordering( arcsType *arcs.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS phf I L R O where: I Name of the file to be used as input. L*R should be an integer. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 1.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /************************************************************************* main(argc.ooks_Algorithms_Collection2ed/books/book5/chap13. */ */ /* arg3: ratio. /* arg1: key file. arg4: spec file arg2: key set size.. If they all succeed. *argv[ ].. argv ) int char argc. write the MPHF to the spec file. non-zero on failure. exit(). ordering and searching three tasks. free_vertices().htm (47 of 103)7/3/2004 4:21:15 PM .zero on success. allocate_vertices(). allocate_arcs(). **/ main( argc. free_arcs(). Take the inputs and call three routines to carry out mapping. ordering(). argv) Returns: Purpose: int -.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS extern void extern void #else extern void extern void extern void extern void extern void extern void #endif free_vertices( verticesType* vertices ). exit().

htm (48 of 103)7/3/2004 4:21:15 PM .0 ){ fputs ("The 3rd parameter must be a positive floating-point value.\n". status.. exit(1). } else if ( (ratio = atof(argv[3]) ) <= 0. /* tables. int double arcsType verticesType lines_in_keyword_file. /* These variables hold all the arcs /* and vertices generated. char *key_file_name. "Usage: %s keywords kw-lines ratio output-file\n".ooks_Algorithms_Collection2ed/books/book5/chap13. } key_file_name = argv[1]. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. stderr). vertices. arcs. \n". */ /* Seed used to initialize the three random */ */ *specification_file_name.. */ */ if ( argc != 5 ) { fprintf(stderr.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { int seed. exit(1). /* Return status variable. if ( (lines_in_keyword_file = atoi(argv[2])) <= 0 ) { fputs("The 2nd parameter must be a positive integer. ratio. argv[0]).

(int)(lines_in_ keyword_file * ratio) ). specification_file_name = argv[4]. (status == NORM ? "succeeded" : "failed")).. exit(1). int ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. if ( (status = mapping( key_file_name. &vertices ). } /************************************************************************* allocate_arcs( arcsType*.. &vertices )) == NORM ) write_gfun ( &arcs. &vertices )) == NORM && (status = verify_mphf( &arcs. lines_in_keyword_file ).Information Retrieval: CHAPTER 13: HASHING ALGORITHMS stderr).htm (49 of 103)7/3/2004 4:21:15 PM . "MPHF creation %s. } free_arcs( &arcs ). return(status).\n". seed.ooks_Algorithms_Collection2ed/books/book5/chap13. specification_file_name ). free_vertices( &vertices ). allocate_vertices( &vertices. &arcs. &vertices. if ( (status = searching( &arcs. } allocate_arcs ( &arcs. &seed )) == NORM ) { ordering( &arcs. &vertices. fprintf(stderr.

allocate space for an arc structure containing that many arcs.htm (50 of 103)7/3/2004 4:21:15 PM . **/ void allocate_arcs( arcs. /* in: Expected number of arcs.ooks_Algorithms_Collection2ed/books/book5/chap13. } /************************************************************************* allocate_vertices( verticesType* . n ) arcsType int { arcs->no_arcs = n.. n ).Information Retrieval: CHAPTER 13: HASHING ALGORITHMS Returns: Purpose: data void Given an expected number of arcs. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. allocate space for a *arcs. arcs->arcArray = (arcType*) owncalloc( sizeof(arcType). /* out: Receives allocated storage. /* out: Receives allocated storage. and place the number of vertices in the "no_vertices" field of the vertex data structure. n. **/ void allocate_vertices( vertices. n ) verticesType *vertices. int ) Purpose: vertex Given an expected number of vertices. */ */ data structure containing that many vertices.. and place the number of arcs in the "no_arcs" field of the arc data structure.

*/ *vertices. n ). } /************************************************************************* free_vertices( verticesType* ) Purpose: **/ void free_vertices( vertices ) verticesType allocate. /* in: Expected number of vertices. n.ooks_Algorithms_Collection2ed/books/book5/chap13. /* in out: Space to deDeallocate space for a vertex data structure. *arcs.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS int { if (n % 2 != 0) n++.. } /************************************************************************* free_arcs( arcsType* ) Purpose: **/ void free_arcs( arcs ) arcsType { free( (char *)arcs->arcArray ). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. vertices->vertexArray = (vertexType*) owncalloc(sizeof(vertexType).htm (51 of 103)7/3/2004 4:21:15 PM .. */ vertices->no_vertices = n. /* in out: Space to de-allocate. */ Deallocate space for an arc data structure.

c ******************************* Purpose: Provenance: Implement the mapping stage of the MPHF algorithm..h" #ifdef__STDC__ extern void extern void extern int extern int extern void initialize_arcs( arcsType *arcs ). Fox.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { free( (char *)vertices->vertexArray ). Chen and E. Written and tested by Q. map_to_triples( char *key_file. the "g" field of the vertex structure is used here to hold a vertex's degree. arcsType *arcs. March 1991.ooks_Algorithms_Collection2ed/books/book5/chap13.. Edited and tested by S.h" #include "pmrandom. April 1991. construct_graph( arcsType *arcs.h> #include "types. initialize_vertices( verticesType *vertices ). Notes: To save space. } /****************************** mapping.htm (52 of 103)7/3/2004 4:21:15 PM . **/ #include <stdio.h" #include "compute_hfns. Wartik.h" #include "randomTables. verticesType *vertices ). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. check_dup( arcType *firstArc ).

randomTablesType tables ).Allocating the arcs and vertices structures.. extern void #else extern void extern void extern int extern int extern void extern void #endif /************************************************************************* mapping( char*. seed ) char *key_file. arcs.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS int r.htm (53 of 103)7/3/2004 4:21:15 PM . -. vertices.NORM if a mapping can be found. ordered by degree. **/ int mapping( key_file. map_to_triples(). exit().ooks_Algorithms_Collection2ed/books/book5/chap13. exit( int status ). verticesType*. This involves: -. ABNORM if not. hl. check_dup(). and h2 functions.Building the lists of edges.. -. construct_graph(). int ) Return: Purpose: construct int -. Perform the mapping stage: Map all keys to triples and initialize_arcs(). the bipartite graph. arcsType*. initialize_vertices(). */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* in: name of file containing keys.Generating the h0.

if ( construct_graph(arcs.\n" : "Giving up. */ /* out: seed selected to initialize the */ /* random tables. vertices) == NORM ) return(NORM). */ while ( mapping_tries++< MAPPINGS ) { initialize_arcs( arcs ). } return(ABNORM). vertices->no_vertices/2. initialize_randomTable( randomTables. arcs. /* out: vertices in bipartite graph. randomTablesType randomTables.\n"). { int mapping_tries = 0. int *seed. initialize_vertices( vertices )... randomTables ).ooks_Algorithms_Collection2ed/books/book5/chap13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS arcsType *arcs. } /************************************************************************* map_to_triples( char*. /* Three random number tables. randomTablesType ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. map_to_triples( key_file. */ */ verticesType *vertices. fputs((mapping_tries < MAPPINGS ? "Trying again. seed ).htm (54 of 103)7/3/2004 4:21:15 PM . /* out: arcs in bipartite graph. stderr). int. arcsType*.

/* Key string holder. */ */ */ string[MAX_KEY_LENG]. exit(1). /* Iterator over arcs.htm (55 of 103)7/3/2004 4:21:15 PM . arcs. } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. if ( (fp = fopen(key_file. /* Input file pointer. *arcs. hl. r. "r")) == NULL ) { fprintf(stderr.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS Return: Purpose: void Compute triples of (h0.\n". key_file). */ string[strlen(string)-1 = '\0'. r. } while ( fgets( string.. r. /* in: key file name /* out: the arcs data structure /* in: size of hl or h2 side /* in: random number tables */ */ */ */ randomTablesType tables. fp ) != NULL && i <arcs->no_arcs) { /* Exclude the '\n'. **/ void map_to_triples(key_file. h2) for all keys and store them in the arc data structure. compute_h012 ( arcs->no_arcs. MAX_KEY_LENG. i = 0.ooks_Algorithms_Collection2ed/books/book5/chap13. "Can't read \"%s\". &arcs->arcArray[i++] ).. tables) char arcsType int *key_file. string. tables. { FILE int char *fp.

exit(1). i. fputs("Re-execute with correct value. } fclose(fp). not %d. arcs->no_arcs). exit(1)..\n".ooks_Algorithms_Collection2ed/books/book5/chap13. "File \"%s\" contains more than %d keys. On successful file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. fputs("Re-execute with correct value. } else if ( ! feof(fp ) ) { fprintf(stderr. verticesType* ) Return: int -.NORM if a graph can be built without duplicate arcs. stderr). Purpose: return. key_file. "File \"%s\" contains %d keys. arcs->no_arcs). ABNORM if it can't..htm (56 of 103)7/3/2004 4:21:15 PM . key_file.\n". and placed Construct the bipartite graph out of triples. ". stderr). -.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS if ( i != arcs->no_arcs) { fprintf(stderr.Each vertex's degree has been determined. ". } /********************************8**************************************** construct_graph( arcsType*.

-. */ */ verticesType *vertices.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS in its "g" field. status = NORM. **/ int construct_graph( arcs.g++.The maximal degree of the graph has been determined.ooks_Algorithms_Collection2ed/books/book5/chap13. j++ ) { /* Iterator over all arcs. /* in out: vertices. vertices ) arcsType *arcs. -.htm (57 of 103)7/3/2004 4:21:15 PM .next_edge[j] = /* Update vertex degree */ /* count and vertex /* adjacency list.h12[j].The "first_edge" field of vertices is a linked list of adjacent edges.first_edge. i++ ) { vertex = arcs->arcArray[i]. j. vertices->vertexArray[vertex]. j < 2. { int i... vertices->vertexArray[vertex]. /* j = 0 and 1 for h1 and h2 side. vertex. i < arcs->no_arcs.first_edge = &arcs ->arcArray[i]. /* in out: arcs. */ */ vertices->vertexArray[vertex]. if ( (j == 0) && check_dup( &arcs->arcArray[i] ) == ABNORM ) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. for ( j = 0. respectively */ */ for ( i = 0. arcs->arcArray[i].

*/ if ( vertices->vertexArray[vertex]. status = ABNORM. stderr).htm (58 of 103)7/3/2004 4:21:15 PM . } } return(status). Purpose: int -.. { arcType *arc = firstArc->next_edge[0].g > vertices->maxDegree ) vertices->maxDegree = vertices->vertexArray[vertex].ooks_Algorithms_Collection2ed/books/book5/chap13..\n". } /* Figure out the maximal degree of the graph.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS fputs("Duplicate found. ABNORM if one Test if some arc on the arc list has an identical triple to the first arc on the list. break. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.NORM if no duplicate triple exists. /* in: arc at the head of a list.g. } /************************************8************************************ check_dup( arcType ) Return: does. **/ int check_dup(firstArc) arcType *firstArc.

for ( i = 0. } /************************************************************************* initialize_arcs( arcsType* ) Return: Purpose: **/ void initialize_arcs( arcs ) arcsType *arcs. { int i. i < arcs->no_arcs. i++ ) { arcs->arcArray[i].next_edge[1] = 0. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. arc = arc->next_edge[0]... } return(NORM). */ /* Duplication found. arcs->arcArray[i]. /* No duplication.ooks_Algorithms_Collection2ed/books/book5/chap13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS while ( arc != 0 ) { if ( ( firstArc->h0 == arc->h0 ) && ( firstArc->h12[1] == arc->h12[1] ) ) return(ABNORM).next_edge[0] = 0.htm (59 of 103)7/3/2004 4:21:15 PM . */ void Make the edge pointers of each arc nil. } /* out: arcs structure.

Wartik.htm (60 of 103)7/3/2004 4:21:15 PM .g = 0.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } /************************************************************************* initialize_vertices( verticesType* ) Return: Purpose: void For each vertex.first_edge = 0. { int i. April 1991.c ************************* Purpose: Provenance: Implement the ordering stage of the MPHF algorithm.. **/ void initialize_vertices( vertices ) verticesType *vertices. vertices->vertexArray[i]. Written and tested by Q. Notes: None. set the degree to 0 and make the edge list empty. i++ ) { vertices->vertexArray[i]. /* out: vertex structure. Fox. for ( i = 0.. March 1991. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. */ Edited and tested by S.ooks_Algorithms_Collection2ed/books/book5/chap13. } } /*********************************** ordering. Chen and E. vertices->maxDegree = 0. i < vertices->no_vertices.

h" #include "vheap. initialize_rList().ooks_Algorithms_Collection2ed/books/book5/chap13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS **/ #include <stdio. vertices ) Return: Purpose: Notes: void Generate an ordering of the vertices. initialize_rList( verticesType *vertices ). The ordering of the vertices is a linked list. delete_from_rList( vertexType *vertex.h> #include "types.htm (61 of 103)7/3/2004 4:21:15 PM . of which is in vertices->vsList. verticesType *vertices ). the head delete_from_rList().h" #include "support.. append_to_VS().h" #ifdef __STDC__ extern void *vertices ). The "next element" file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. extern void extern void #else extern void extern void extern void #endif /************************************************************************* ordering( arcs. verticesType append_to_VS( vertexType *vertex.

/* in out: the arcs data structure. */ initialize_vheap(). The other is to be part of the rList used in this step. */ initialize_rList( vertices ). */ */ *vertices. allocate_vheap( arcs->no_arcs.ooks_Algorithms_Collection2ed/books/book5/chap13. /* Initialize the VS list. vertices->no_vertices ). do { vertex->g = 0. */ /* Process each component file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Indicates side of graph. Note that the "succ" field has two purposes in this step. while ( vertices->rlistHead != -l ) { graph. /* in out: the vertices data structure. vertex = &vertices->vertexArray[vertices->rlistHead]. vertices ) arcsType verticesType { int side.. *arc.. **/ void ordering( arcs. degree. One is that just mentioned.htm (62 of 103)7/3/2004 4:21:15 PM . vertices->vsHead = vertices->vsTail = NP. /* Mark node "visited".Information Retrieval: CHAPTER 13: HASHING ALGORITHMS pointer for each node is in the "succ" field of each vertex component. */ *arcs. vertexType arcType *vertex.

} arc = arc->next_edge[side]. if ( vertex->first_edge != 0 ) { /* Add adjacent nodes that are not visited and /* not in virtual heap to the virtual heap. } } } while ( max_degree_vertex( &vertex ) == NORM ).Information Retrieval: CHAPTER 13: HASHING ALGORITHMS delete_from_rList( vertex.htm (63 of 103)7/3/2004 4:21:15 PM . vertices->vertexArray[adj_node].g *= -1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. */ */ side = vertex . append_to_VS( vertex. vertices ).g.ooks_Algorithms_Collection2ed/books/book5/chap13. if ( degree > 0 ) { /* One such node is found. adj_node = arc->h12[(side+1)%2]. degree ). vertices ).vertices->vertexArray >= vertices->no_vertices/2. */ */ add_to_vheap( &vertices->vertexArray[adj_node]. /* Node adjacent to vertex.. degree = vertices->vertexArray[adj_node]. } free_vheap().. arc = vertex->first_edge. while ( arc != 0 ) { int adj_node.

*/ */ /* out: vertices data structure. else vertices->rlistHead = vertex->succ. **/ void delete_from_rList( vertex.htm (64 of 103)7/3/2004 4:21:15 PM .. vertices ) vertexType verticesType { if ( vertex->prec != NP ) vertices->vertexArray[vertex-prec]. /* in: vertex to delete. if ( vertex->succ != NP ) vertices->vertexArray[vertex->succ]. } /********************************************************************* append_to_VS( vertex. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } delete_from_rList( vertex. succ = vertex ->succ. vertices ) Return: Purpose: void Append the vertex to the vertex ordering VS.ooks_Algorithms_Collection2ed/books/book5/chap13. *vertex..prec = vertex->pre>c. *vertices. vertices ) Return: Purpose: void Delete a vertex pointing at by vertex from the rList stored in the vertices data structure.

ooks_Algorithms_Collection2ed/books/book5/chap13. */ /* out: the vertices data structure. vertices ) vertexType *vertex.. . verticesType *vertices.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS **/ void append_to_VS( vertex.vertices->vertexArray. else vertices->vertexArray[vertices->vsTail]. } /************************************************************************* initialize_rList( vertices ) Return: Purpose: void Set up an rList from the vertices..htm (65 of 103)7/3/2004 4:21:15 PM pred and succ are used to store the list. vertex->succ = vertex->prec = NP.succ = newTail. { int newTail = vertex . */ doubly-linked list of vertices in decending order of degree. vertices->vsTail = newTail. An rList is a /* in: the vertex to be added. if ( vertices->vsHead == NP ) vertices->vsHead = newTail. Notes: **/ void initialize_rList( vertices ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

maxDegree.htm (66 of 103)7/3/2004 4:21:15 PM . i < vertices->no_vertices.g].count.intSetRep[i] = NP. i++ ) { previous = heads. tails. sizeof(int) ).intSetRep = (int*)owncalloc( tails. i < tails.ooks_Algorithms_Collection2ed/books/book5/chap13..count. i++ ) tails.intSetRep[vertices->vertexArray[i]. */ /* The elements of "tails" are the corresponding tails. sizeof(int) ). 0<=i<=maxDegree. j. 1. at */ tails.count = vertices->maxDegree + 1. /* Two sets of pointers. /* Construct lists for vertices being of */ /* degree 0. heads. Element i of "heads" points /* the head of a list about degree i.intSetRep = (int*)owncalloc( heads. { int i.count = vertices->maxDegree + 1.. for ( i = 0. i++ ) heads. tails. if ( previous != NP ) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.succ = previous..Information Retrieval: CHAPTER 13: HASHING ALGORITHMS verticesType *vertices. */ for ( i = 0. /* in out: vertices to be ordered. vertices->vertexArray[i]. */ intSetType heads.count. . */ heads. previous. i < heads.count. for ( i = 0.intSetRep[i] = NP..

vertices->vertexArray[i]. .intSetRep[i] != NP ) { for ( j = i . maxDegree.intSetRep[vertices->vertexArray[i].1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD... j >= 1.intSetRep[j]].prec = tails. i > 1.htm (67 of 103)7/3/2004 4:21:15 PM */ */ . } /* Construct the rList by linking lists for vertices being of /* degree 0..g] = i.prec = NP.intSetRep[j]. } } vertices->rlistHead = heads.intSetRep[i].1.intSetRep[j] ! = NP ) break.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS vertices->vertexArray[previous]. heads.intSetRep[i]].) if ( heads. for ( i = heads.ooks_Algorithms_Collection2ed/books/book5/chap13. else tails. vertices->vertexArray[heads. 1.) if ( tails..succ = heads.g] = i. j-.count .intSetRep[vertices-> maxDegree]. if ( j >= 1 ) { vertices->vertexArray[tails. i -.intSetRep[vertices->vertexArray[i].intSetRep ). free( (char *)heads.prec = i.

int Set the seed for the random number generator.. Chen and E.. April 1991.ooks_Algorithms_Collection2ed/books/book5/chap13. **/ #include "pmrandom. Fox. Notes: It is assumed that the C data type "int" can store 32-bit quantities.htm (68 of 103)7/3/2004 4:21:15 PM .intSetRep ). None. March 1991. Written and tested by Q. Edited by S. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. See above. generator.h" static int seed = DEFAULT_SEED. Wartik. */ /* The seed of the random number /************************************************************************* setseed(int) Returns: Purpose: Plan: Notes: **/ void setseed( new_seed ) int new_seed. Uses a formula suggested by Park and Miller. } /********************************* pmrandom.c *************************** Purpose: Provenance: Implement a random-number generator package for this program.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS free( (char *)tails.

. if ( (new_seed < 1) || (new_seed > 2147483646) ) new_seed = DEFAULT_SEED. high. } /************************************************************************* pmrandom() Returns: Purpose: Plan: void Return the next random number in the sequence.htm (69 of 103)7/3/2004 4:21:15 PM None. setseed(seed).2836 * high. 2147483646]. Uses the formula: f() = ( 16807 * seed ) mod 2147483647. low = new_seed % 127773. seed = ( test > 0 ) ? test : test + 2147483647. /* 2836 = 2147483647 mod 16807 */ /* 127773 = 2147483647 div 16807 */ test = 16807 * low . . .. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD... high = new_seed / 127773. The value of "seed" must be within [1. test.ooks_Algorithms_Collection2ed/books/book5/chap13..Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { int low. Notes: **/ int pmrandom() { int tmp = seed.

Edited and tested by S. April 1991.c ****************************** Purpose: Provenance: Routines for handling the random number tables. } /************************** randomTables. Chen and E.h" #include "randomTables. Fox.. March 1991.ooks_Algorithms_Collection2ed/books/book5/chap13. Written and tested by Q. Notes: **/ #include "types.h" #include "pmrandom. } /************************************************************************* getseed() Returns: Purpose: Notes: **/ int getseed() { return (seed).h" None.htm (70 of 103)7/3/2004 4:21:15 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. int Get the current value of the seed. None.. Wartik.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS return(tmp).

. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. } /***************************** regen_driver.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS / ************************************************************************** initialize_randomTable(randomTablesType.. *seed = getseed().htm (71 of 103)7/3/2004 4:21:15 PM . k++) tables[i][j][k] = pmrandom(). */ void Initialize the three random number tables and return the /*out: seed used to initialize tables. i < NO_TABLES. for (i=0. j++) for (k = 0. k < COLUMNS. setseed(*seed). **/ void initialize_randomTable(tables. int { int i. */ *seed.c *************************** Purpose: hashing A program to test regenerating and using a precomputed /*Initialize the tables. k. i++) for (j=0. /*out: Tables of random numbers. j < ROWS.ooks_Algorithms_Collection2ed/books/book5/chap13. */ /*Iterators over the tables.seed) randomTablesType tables.int) Return: Purpose: seed used. j.

.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS function. . April 1991. April 1991. written to stdout.htm (72 of 103)7/3/2004 4:21:15 PM retrieveAll ( mphfType *mphf. Provenance: Written and tested by Q. exit( int status ). retrieveAll ().h" #include "regen_mphf.ooks_Algorithms_Collection2ed/books/book5/chap13. Fox. Wartik.. exit(). Chen and E.h" #include "randomTables.h > #include < math. char *key_file ).h > #include < string. Notes: The program is used as follows: regen_driver mphf-file keyword-file The result is a set of lines. Edited and tested by S.h" #ifdef _STDC_ extern void extern void #else extern void extern void #endif /************************************************************************* file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. indicating the bucket of each keyword in the keyword file.h > #include "types. **/ #include < stdio.

argv[1]). } if ( regen_mphf ( &mphf. argv[2] ). char** ) Return: Purpose: **/ main( argc. } release_mphf ( &mphf ). argv[1] ) == NORM ) retrieveAll ( &mphf.htm (73 of 103)7/3/2004 4:21:15 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. argv ) int argc.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS main( int.\n". "Usage: %s mphf-file key-file\n".. argv[0]). mphfType mphf. /* arg1: mphf file. "Can't regenerate hashing function from \"%s\". exit(1). if ( argc != 3 ) { fprintf(stderr. See the header for this file. char *argv[ ]. exit(0). arg2: key file */ Nothing. else { fprintf(stderr. exit(1)..ooks_Algorithms_Collection2ed/books/book5/chap13.

char* ) Return: Purpose: void Given a file of keys and a structure describing a MPHF previously computed for those keys. print each key's location on the standard output stream. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. exit(1). */ if ( (fp = fopen(key_file. "r")) == 0 ) { fprintf (stderr.ooks_Algorithms_Collection2ed/books/book5/chap13. **/ void retrieveAll( mphf. /* Computed hash value. key_file). "Can't read file \"%s\ ". hash.. */ */ /* The maximum number of chars */ /* needed to represent a bucket */ /* index as a string. /* Handle for specification file. string [MAX_KEY_LENG]. *key_file. max_bucket_length.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } /************************************************************************* retrieveAll( mphfType*. key_file ) mphfType char { FILE char int int *fp.htm (74 of 103)7/3/2004 4:21:15 PM . /* in: the key file. */ */ */ *mphf. /* in: mphf specification. /* Key string. \n".

c **************************** Purpose: Routines to regenerate and use a previously-computed minimal perfect hashing function. max_bucket_length. Notes: **/ #include <stdio. while ( fgets( string. Provenance: Written and tested by Q. March 1991.h> #include "types.. Wartik. } fclose(fp). Fox.. hash = retrieve( mphf. MAX_KEY_LENG.h" #include "randomTables.h" #include "regen_mphf. .htm (75 of 103)7/3/2004 4:21:15 PM None.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } max_bucket_length = (int)log10((double)mphf->no_arcs) + 1.h" /************************************************************************* file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. string). fp ) != 0 ) { string[strlen(string)-1] = '\0'. Chen and E. printf("Bucket %*d: %s\n".ooks_Algorithms_Collection2ed/books/book5/chap13. hash. } /****************************** regen_mphf. string ). Edited and tested by S.h" #include "compute_hfns. April 1991.

The retrieve() procedure can use these numbers to re-create the h0. "r")) == NULL ) return ABNORM. /* Iterator through vertices. *spec_file. /* in: MPHF specification file.ooks_Algorithms_Collection2ed/books/book5/chap13. if ( fscanf(spec_file. ABNORM if it couldn't. there is no way to tell what caused the error. "%d\n%d\n%d\n". *spec_file_name.NORM if the MPHF could be reconstructed. If the specification file doesn't seem to correspond to the expected format. h1 and h2 values. However. the hash value. &mphf->no_arcs. **/ int regen_mphf( mphf. &mphf->seed) != 3 ) { file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..htm (76 of 103)7/3/2004 4:21:15 PM . Purpose: Notes: Regenerate a MPHF from a specification file. */ if ( (spec_file = fopen(spec_file_name. /* out: the regenerated MPHF structure. spec_file_name ) mphfType */ char { int FILE i.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS regen_mphf( mphfType*.. and from that. char* ) Return: int -. */ *mphf. ABNORM is returned. What is regenerated is the table of random numbers. &mphf->no_vertices.

. fclose(spec_file). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. for ( i = 0.htm (77 of 103)7/3/2004 4:21:15 PM /* File is improperly formatted. return ABNORM. } initialize_randomTable( mphf->tables.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS fclose(spec_file).. &mphf->gArray[i] ) != 1 { fclose(spec_file). return ABNORM. return ABNORM. } if ( ! feof(spec_file) ) { fclose(spec_file). i < mphf->no_vertices. return NORM. &mphf->seed ). i++) if ( fscanf(spec_file.ooks_Algorithms_Collection2ed/books/book5/chap13. */ /* File is improperly formatted. */ /* File is improperly formatted. */ void Release the dynamically-allocated storage associated with . } /************************************************************************* release_mphf( mphfType*. char* ) Return: Purpose: an MPHF. "%d\n". sizeof(int) ). } mphf->gArray = (int*) owncalloc( mphf->no_vertices.

*/ /* Storage used to hold the h0. */ int -.mphf-no_arcs-1. */ *mphf. mphf->tables.htm (78 of 103)7/3/2004 4:21:15 PM .Information Retrieval: CHAPTER 13: HASHING ALGORITHMS **/ void release_mphf( mphf ) mphfType *mphf. h1 and h2 values. key ) mphfType char { int arcType */ hash. } /************************************************************************* retrieve( mphfType*.h0 + file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.ooks_Algorithms_Collection2ed/books/book5/chap13. arc. &arc). /* in out: pointer to the MPHF structure..a value in the range 0. char* ) Return: Purpose: **/ int retrieve( mphf. *key. Given an MPHF and a key.. { free( (char *)mphf->gArray ). compute_h012(mphf->no_arcs. /* The computed hash value. hash = abs(arc. return the key's hash value.. */ /* in: the key. /* in: the mphf specification. terminated by a null character. key. (mphf->no_vertices) / 2.

h> #include "types. } /*************************** searching. char* disk.h12[1]] ) % mphf->no_arcs. intSetType* slotSet ). verticesType* vertices. Chen and E. Wartik.htm (79 of 103)7/3/2004 4:21:15 PM initialize_primes( int n. March 1991. Written and tested by Q.h12[0]] + mphf->gArray[arc. char* disk ).c *********************************** Purpose: Provenance: Implement the searching stage of the MPHF algorithm. int i.h" #include "support. April 1991. extern void initialize_search( arcsType* arcs. Edited and tested by S..ooks_Algorithms_Collection2ed/books/book5/chap13. The other two stages must have been performed already. intSetType* primes ).h" #include "pmrandom.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS mphf->gArray[arc. verticesType* vertices. intSetType *primes. extern void #else file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Fox.. Notes: **/ #include <stdio. . return hash.h" #ifdef __STDC__ extern int fit_pattern( arcsType* arcs.

**/ int searching( arcs.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS extern int extern void extern void #endif fit_pattern(). searching_tries = 0. However. and since this routine calls fit_pattern() repeatedly.*/ /* Condition variable. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.NORM on success. initialize_primes(). The "prec" field is used as the "vertex visited" marker. *vertices. ABNORM on failure. Search a MPHF for the key set. */ */ *arcs. The slotSet variable actually is only used in fit_pattern().ooks_Algorithms_Collection2ed/books/book5/chap13. since storage for it must be dynamically allocated. it's declared here. /* Each vertex in the VS list. vertices ) arcsType verticesType { int i. status = ABNORM. verticesType* ) Return: Purpose: Notes: int -. initialize_search(). /* Running count of searching tries..htm (80 of 103)7/3/2004 4:21:15 PM .. /************************************************************************* searching( arcsType*. where storage can be allocated just once.

i = vertices->vsHead. vertices. primes. /* Simulated hash table.. */ initialize_search( arcs. } else /* Search succeeded.htm (81 of 103)7/3/2004 4:21:15 PM . */ /* Search failed at vertex i. */ disk = (char*) owncalloc( arcs->no_arcs. */ vertices->vertexArray[i].. disk. /* Set of hash addresses.prec = VISIT. while ( (searching_tries++ < SEARCHINGS) && (status == ABNORM) ) { status = NORM. Try /* a new pattern. */ /* Table of primes for pattern shifts.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS char intSetType */ slotSet. sizeof(char) ). disk ). initialize_primes( arcs->no_arcs. vertices. } } free( disk ). */ */ i = vertices->vertexArray[i]. while ( i != NP ) { /* Fit keys of level of vertex i onto the disk. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.succ. /* Get the highest-level vertex. if ( fit_pattern(arcs. break. Proceed to next node. slotSet. *disk. sizeof(int) ). &primes.intSetRep = (int*) owncalloc( vertices->maxDegree. &primes ). i.ooks_Algorithms_Collection2ed/books/book5/chap13. &slotSet)== ABNORM ) status = ABNORM.

/* in out: The hash table (disk).intSetRep ). */ */ *vertices.ooks_Algorithms_Collection2ed/books/book5/chap13. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.NORM if a fit is found. and the slots on the disk for the vertices are filled. */ char intSetType *slotSet. intSetType*.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS free( (char *)slotSet. free( (char *)primes. If a pattern is found. then the g values for vertices on that level are set appropriately. i. int. /* in out: The vertices in the graph. **/ int fit_pattern( arcs. primes. i.. } /************************************************************************* fit_pattern( arcsType*.. /* in: Prime number table */ */ */ /* Set of slots taken by keys in this pattern. char*.htm (82 of 103)7/3/2004 4:21:15 PM . /* in: The arcs in the graph. ABNORM if not. *primes. { *arcs.intSetRep ). intSetType* ) Return: Purpose: int -. Compute a pattern for a level and fit it onto the hash table. vertices. verticesType*. disk. slotSet ) arcsType verticesType int list. return(status). /* in: Vertex's location in vertex-selected *disk.

/* Initialize slot set to empty.prec == VISIT ) { hashAddress = abs(arc-h0 + vertices->vertexArray[arc->h12[0]].g ) % arcs->no_arcs. while ((no_fits++ < arcs->no_arcs) && (fitOK == ABNORM) ) { fitOK = NORM. slotSet->count = 0.ooks_Algorithms_Collection2ed/books/book5/chap13. /* See if this key can be put at hashAddress.. /* Shift value for the pattern. */ /* If the key for arc is at this level. shift. */ */ */ */ */ hashAddress fitOK = ABNORM. /* Current arc. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. *arc. shift = primes->intSetRep[pmrandom () % primes->count]. /* get its hash address. /* Hash address being tried. /* Running count of attempts to fit.first_edge.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS arcType int side. while ( arc != 0 ) { /* Iterate over all arcs in this level.. /* Fit condition variable. no_fits = 0. /* Side indicator (0 or 1). */ */ if ( vertices->vertexArray[arc->hl2[(side+l)%2]]. */ arc = vertices->vertexArray[i].g + vertices->vertexArray[arc->h12[1]]. */ side = (i >= vertices->no_vertices/2).htm (83 of 103)7/3/2004 4:21:15 PM .

fitOK = ABNORM.htm (84 of 103)7/3/2004 4:21:15 PM /* Hash next arc. Remember the address. /* Try a new shift. disk[hashAddress] = FULL. Clear */ /* marked slots in disk.. } /* end of inner while */ } /* end of outer while */ return(fitOK). */ vertices->vertexArray[i]. } else { /* Success.g = ( vertices->vertexArray[i]. } /************************************************************************* file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS if ( disk[hashAddress] != EMPTY ) { int k. */ */ /* and mark the table. k < slotSet->count. */ for ( k = 0.g + shift ) % arcs->no_arcs. slotSet->intSetRep[slotSet->count++] = hashAddress. k++ ) disk[slotSet->intSetRep[k]] = EMPTY. } } /* end of if */ arc = arc->next_edge[side].. /* Collision.ooks_Algorithms_Collection2ed/books/book5/chap13. break. */ .

*/ for ( i = 0. */ for ( i = 0. /* Set the seed. vertices. **/ void initialize_search( arcs. *disk. /* in: arcs. /* out: vertices. mark all vertices un-visited. } /************************************************************************* file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* out: the hash table.g = pmrandom() % arcs->no_arcs.. setseed ( pmrandom () ). i++ ) { vertices->vertexArray[i].Information Retrieval: CHAPTER 13: HASHING ALGORITHMS initialize_search( arcsType*. disk ) arcsType *arcs.ooks_Algorithms_Collection2ed/books/book5/chap13. i < arcs->no_arcs. */ */ */ verticesType *vertices. } /* Reset the hash table. char* ) Return: Purpose: void Prepare for the search stage: Put random values in all the g fields.htm (85 of 103)7/3/2004 4:21:15 PM . verticesType*. i < vertices->no_vertices. char { int i.. disk[i++] = EMPTY ).prec = NOTVISIT. and empty the disk. vertices->vertexArray[i].

although it /* is not a prime.. */ primes->count = l. intSetType* ) Return: Purpose: **/ void initialize_primes( n. */ while ( (testingNumber++ < n) && (primes->count < PRIMES) ) { if ( n % testingNumber != 0 ) { for ( i = testingNumber .l. { int i.. sizeof(int) ). } /* end of if */ /* Get first PRIMES-l*/ /* prime numbers. testingNumber = 2. /* 1 is added to the table. primes->intSetRep[0] = 1. */ /* in: the size of the hash table.ooks_Algorithms_Collection2ed/books/book5/chap13. if ( i == 1 ) primes->intSetRep[primes->count++] = testingNumber. primes ) int n.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS initialize_primes( int. intSetType *primes. /* Testing number for possible prime numbers. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. i--) if ( testingNumber % i == 0 ) break. /* out: the prime number table. primes->intSetRep = (int*) owncalloc( PRIMES.htm (86 of 103)7/3/2004 4:21:15 PM . i> 0. */ */ void Set up the prime number table.

-. Edited and tested by S.A routine to write the MPHF to a file.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS } /* end of while */ } /*************************** support. Notes: **/ #include < stdio. there's no need for fancy error-handling).ooks_Algorithms_Collection2ed/books/book5/chap13. *malloc( unsigned int size ). March 1991. exit().c *********************************** Purpose: Provide some useful support routines: -. None. Wartik. unsigned int size ).. April 1991. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Fox.h" #ifdef __STDC__ extern char extern char extern void #else extern char *malloc().A routine to verify the correctness of a MPHF.h > #include "types. *realloc ( char *area. Provenance: Written and tested by Q. -.htm (87 of 103)7/3/2004 4:21:15 PM . Chen and E.Storage allocators that exit on error (since this isn't a subroutine library..

stderr). Allocate a chunk of memory of 'n' elements each of size exit().\n". exit(1).Pointer to a chunk of memory. */ fputs("Panic: cannot allocate memory.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS *realloc(). (temp = malloc( (unsigned int)(n*size) )) == 0 ) { *owncalloc( n.htm (88 of 103)7/3/2004 4:21:15 PM . **/ char int size. extern void #endif /************************************************************************* owncalloc( n. { char if ( *temp. char * -. } /************************************************************************* file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Return the pointer to the chunk.. /* in: number of elements. */ /* in: size of each element. Abort if no space is available.. } return(temp).ooks_Algorithms_Collection2ed/books/book5/chap13. size ) Return: Purpose: 'size'. size ) n.

Information Retrieval: CHAPTER 13: HASHING ALGORITHMS ownrealloc( n... } /************************************************************************* write_gfun( arcs. new_size ) *area. } return(temp).make it new_size bytes.ooks_Algorithms_Collection2ed/books/book5/chap13. tbl_seed. new_size. vertices. /* in: area to re-allocate. *ownrealloc( area. exit(1). stderr).Pointer to a chunk of memory. */ */ if ( (temp = realloc( area. **/ char char int { char *temp. spec_file ) Return: Purpose: **/ void Write the MPHF specification to a file file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* in: new_size.\n". Re-allocate a chunk of memory pointed to by area -. Abort if no space is available.htm (89 of 103)7/3/2004 4:21:15 PM . (unsigned)new_size )) == 0 ) { fputs("Panic: cannot reallocate memory. size ) Return: Purpose: char * -.

for ( i = 0. */ */ /* in: seed used to set up random number *spec_file. /* in: the arcs. exit(1). "w")) == NULL ) { fprintf(stderr. arcs->no_arcs. vertices->vertexArray[i]. spec_file). i < vertices->no_vertices. vertices->no_vertices. tbl_seed). i++ ) fprintf(fp. spec_file ) arcsType verticesType int tables. */ char { int FILE i. } /************************************************************************* file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Iterator through vertices. *vertices."%d\n".g)..ooks_Algorithms_Collection2ed/books/book5/chap13. /* in: name of the specification file.\n". /* in: the vertices. } fprintf(fp. */ if ( (fp = fopen(spec_file.. *fp. tbl_seed. "%d\n%d\n%d\n". tbl_seed. /* Handle for specification file. "Can't create hashing specification file \"%s\".htm (90 of 103)7/3/2004 4:21:15 PM . */ */ *arcs.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS void write_gfun( arcs. fclose(fp). vertices.

vertices ) arcsType *arcs. ABNORM if not. i < arcs->no_arcs. hash. char *disk.h12[0]].Information Retrieval: CHAPTER 13: HASHING ALGORITHMS verify_mphf( arcs.htm (91 of 103)7/3/2004 4:21:15 PM . /* Hash value of a key.. "Panic: negative hash value.h12[1]]. /* in: the arcs.g )% arcs->no_arcs .\n"). for ( i = 0. { int i. vertices ) Return: Purpose: **/ int verify_mphf( arcs.g + vertices->vertexArray[arcs->arcArray[i]. disk[i++] = EMPTY )..h0 + vertices->vertexArray[arcs->arcArray[i].NORM if MPHF is correct. /* in: the vertices. sizeof(char) ). for ( i = 0. Verify the computed MPHF is indeed minimal and perfect verticesType *vertices. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Hash table. i < arcs->no_arcs. status = NORM.ooks_Algorithms_Collection2ed/books/book5/chap13. */ */ disk = owncalloc( arcs->no_arcs. */ */ int -. if ( hash < 0 ) { fprintf(stderr. i++ ) { hash = abs ( arcs->arcArray[i].

i).. Provenance: Written and tested by Q.ooks_Algorithms_Collection2ed/books/book5/chap13. } if ( disk[hash] == FULL ) { fprintf(stderr.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS status = ABNORM. Wartik. Implement a "virtual heap": a combination of stacks and Edited and tested by S. } /******************************* vheap. } free( (char *)disk ). March 1991. Chen and E. break.c ******************************** Purpose: a heap. "Panic: hash entry collided at"). Notes: The point of the combination is that a stack is a more efficient data structure. status = ABNORM. " position %d by the %dth word!\n". April 1991. hash. } else disk[hash] = FULL..htm (92 of 103)7/3/2004 4:21:15 PM . fprintf(stderr. break. Vertices of low degree file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. return(status). Fox.

htm (93 of 103)7/3/2004 4:21:15 PM . } stackType. containing vertex's degree.h> #include <stdio. vertexType *vertex: } heapCell. typedef struct { /* Heap cell data structure. /* Stack data structure.h> #include "types.h" #include "support. since they are more common. /* Info field.h" #include "vheap. **/ #include <math. */ */ */ /* Heap data structure. holding vertex's address. /* Allocated stack area size.ooks_Algorithms_Collection2ed/books/book5/chap13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS (specifically. /* The number of stacks in use.. /* Stack top. /* Stack area. Vertices of high degree are stored in the heap. stackSize. those <= NO_STACKS) are stored in stacks. */ */ */ */ */ */ vertexType **stackRep.h" #define NO_STACKS 6 #define DEF_SIZE 10 typedef struct { int stackTop.. /* The default size of a heap or a stack. typedef struct { int degree. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Key field.

push(). */ */ */ stacks[NO STACKS] heap.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS int heapTop. enter_heap()..htm (94 of 103)7/3/2004 4:21:15 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. vertexType *vertex ). /* Heap area. */ #ifdef __STDC__ extern void extern int extern void extern int #else extern void extern int extern void extern int #endif /************************************************************************* add_to_vheap( vertex. remove_from_heap( vertexType **vertex ).. /* Allocated heap area size. heapSize. pop(). pop( stackType *stack. remove_from_heap(). /* The stacks of the virtual heap. degree ) Return: Purpose: void Add a vertex of a specified degree to the virtual heap. */ /* The heap portion. stackType heapType /* Heap top. } heapType. vertexType *vertex ). push( stackType *stack. vertexType **vertex ). enter heap( int degree.ooks_Algorithms_Collection2ed/books/book5/chap13. heapCell *heapRep.

ooks_Algorithms_Collection2ed/books/book5/chap13. try the stacks. else push( &stacks[degree-1]. Purpose: Find the unvisited vertex with maximal degree from the virtual heap. /* in: a vertex to be added. ABNORM if the *vertex. } /************************************************************************* max_degree_vertex( vertex ) Return: int -. Place it in "vertex".NORM if a vertex could be found. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Plan: First check the heap. vertex ).htm (95 of 103)7/3/2004 4:21:15 PM . */ */ virtual heap (stacks and heap) is empty. vertex ). one at a time. degree ) vertexType int { if ( degree > NO_STACKS ) enter_heap( degree.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS **/ void add_to_vheap( vertex. remove_from_heap() automatically removes a vertex of maximal degree. If the heap is empty.. /* out: the vertex found. /* in: the vertex's degree. **/ int max_degree_vertex( vertex ) vertexType **vertex.. degree.

*/ /* stacks empty? */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. vertex ) Return: Purpose: **/ static void push(stack. i-. */ */ void Push a vertex pointer onto a stack.1. */ if ( stack->stackTop >= stack->stackSize ) { *stack. if ( remove_from_heap( vertex ) == NORM ) /* heap empty? */ return(NORM).. *vertex. /* in: the vertex. return(ABNORM). /* in out: the stack. for( i = NO_STACKS .Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { int i. /* Expand stack if it doesn't have enough space. vertex ) == NORM ) return (NORM). i >= 0..htm (96 of 103)7/3/2004 4:21:15 PM . vertex) stackType vertexType { stack->stackTop++. } /************************************************************************* push(stack. /* No node at all.ooks_Algorithms_Collection2ed/books/book5/chap13.) if ( pop( &stacks[i]. The component has been processed.

**/ static int pop( stack.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS fprintf(stderr. vertex ) Return: Purpose: stack int -. stack->stackRep = (vertexType**)ownrealloc( (char *)stack->stackRep.Index of a vertex. /* stack empty */ *stack. } stack->stackRep[stack->stackTop] = vertex. Return -1 if the was empty. *vertex = stack->stackRep[stack->stackTop--].\n"). sizeof(vertexType*) * stack-stackSize ).. 0 if it wasn't. Re-allocating.. **vertex. Pop up a vertex pointer from the stack. "Warning: stack overflow. vertex ) stackType vertexType { if ( stack->stackTop == -1 ) return(-1). } /************************************************************************* pop( stack. stack->stackSize *= 2.ooks_Algorithms_Collection2ed/books/book5/chap13.htm (97 of 103)7/3/2004 4:21:16 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

} /* stack not empty */ /************************************************************************* enter_heap( degree.ooks_Algorithms_Collection2ed/books/book5/chap13.heapSize. } heap.. */ */ void Insert a vertex pointer and its degree into the heap.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS return(0).heapRep[k/2]. *vertex. vertex ) int vertexType { int k = heap. heap..heapSize ).heapTop++ . while ( heap. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (98 of 103)7/3/2004 4:21:16 PM .heapSize = 2 * heap.degree <= degree ) { degree. if ( k >= heap. /* in: the vertex pointer.vertex = vertex. sizeof(heapCell) * heap.heapRep[k].degree = degree.heapRep = (heapCell*)ownrealloc( (char *)heap. vertex ) Return: Purpose: **/ static void enter_heap( degree.heapRep. heap. /* in: the degree of the node.heapRep[k].heapSize ) { heap.

heapTop-file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. *vertex = heap.vertex.degree = heap. /* Iterators through the heap.vertex = heap.ooks_Algorithms_Collection2ed/books/book5/chap13. tempCell.vertex = vertex. heap.. heap.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS heap. k /= 2.. } heap. */ **vertex. j.heapRep[k/2]. Purpose: return it.heapRep[k].-1 if the heap is empty when the routine is called. */ Remove a vertex of maximal degree from the heap. and /* Heap element currently being examined.heapRep[k]. } /************************************************************************* remove_from_heap( vertex ) Return: int -. 0 if it isn't.heapRep[k/2]. **/ static int remove_from_heap( vertex ) vertexType { int heapCell k. /* out: the vertex selected.htm (99 of 103)7/3/2004 4:21:16 PM .heapRep[k]. */ if ( heap.heapRep[k].vertex.heapRep[1].degree. heap.degree = degree.heapTop == 1 ) return(-1).

.heapRep[j]. heap. heap.heapRep[heap. return(0).degree > heap.heapTop].degree= heap. if ( tempCell.oks_Algorithms_Collection2ed/books/book5/chap13.degree = tempCell.heapRep[1].htm (100 of 103)7/3/2004 4:21:16 PM .heapRep[1].heapTop].vertex = tempCell. tempCell.degree.vertex = heap. k = j.degree = heap.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS tempCell.heapRep[j+1].degree) ) j++. if ( (j < heap. } /************************************************************************* /* Go down the heap.heapRep[j]..degree.vertex = heap.degree ) break.degree = heap. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. } /* end of while */ heap.heapRep[k].vertex.degree< heap.heapRep[j].heapRep[k]. while ( k <= heap.heapRep[k].heapRep[j].degree. k = 1.heapRep[k].heapRep[heap.vertex.vertex = heap. heap.vertex.heapTop / 2 ) { j = 2 * k.heapTop ) && (heap.

i++ ) { heap. i < heap.heapSize. } /************************************************************************* free_vheap() Return: Purpose: **/ void free_vheap() void Deallocate space for stacks and heap. void Set the heap and stacks to their empty states.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS initialize_vheap() Return: Purpose: **/ void initialize_vheap() { int i. } for ( i = 0. i NO_STACKS..htm (101 of 103)7/3/2004 4:21:16 PM .. heap. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr.heapRep[0].heapRep[i]. heap.heapRep[i]. for ( i = 1. heap.stackTop = -1 ).vertex = 0.heapTop = 1.degree = 0. stacks[i++].oks_Algorithms_Collection2ed/books/book5/chap13.degree = MAX_INT.

htm (102 of 103)7/3/2004 4:21:16 PM .heapRep )..oks_Algorithms_Collection2ed/books/book5/chap13.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS { int i. / 2 ) */ /* Pr0 = Pr(X = 0) /* Pri = Pr(X = i) lambda = (double)(2*no_arcs) / (double)no_vertices. double lambda. no_vertices ) Return: Purpose: **/ void allocate_vheap( no_arcs. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. /* partial sum of degree. /* lambda = E / ( */ */ */ V */ sum = 0. free((char *)stacks[i++]. /* in: number of arcs. i NO_STACKS. no_vertices ) int no_arcs. no_vertices.. free( (char *)heap. /* iteration variable. Pr0. } /************************************************************************* allocate_vheap( no_arcs. /* Compute Pr(x = 0). Pr0 = Pri = exp(-lambda). /* in: number of vertices.stackRep) ). for ( i = 0. { int i. */ */ void Estimate and allocate space for the heap and the stacks. Pri.

sum += stacks[i-1]. i++ ) { Pri *= lambda/(double)(i). . if ( heap.heapSize..stackSize . } heap. */ (heapCell*) owncalloc( heap. heap. stacks[i].stackSize = DEF_SIZE.heapSize = no_vertices .stackSize = (int) 2 * no_vertices * Pri. */ stacks[i-1]. */ if ( stacks[i]. . } Go to Chapter 14 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/Dr. 2.stackSize = 0 ) stacks[i]. i++ ) { /* Allocate stack space.. /* NO_STACKS. } for ( i = 0. .stackRep = (vertexType**) owncalloc( stacks[i].stackSize. sizeof(heapCell) ).oks_Algorithms_Collection2ed/books/book5/chap13. sizeof(vertexType*) ). */ /* Compute the expected number */ /* of nodes of degree 1.sum .heapSize = DEF_SIZE.Information Retrieval: CHAPTER 13: HASHING ALGORITHMS for ( i = 1.heapRep = /* Allocate heap space. i NO_STACKS. . i = NO_STACKS.heapSize = 0 ) heap.htm (103 of 103)7/3/2004 4:21:16 PM .(int) 2 * no_vertices * Pr0.

Ranking retrieval systems are particularly appropriate for end-users. it is misspelled. and so on. they tend to provide very poor service to end-users. as modified by statistical term-weighting (to be explained later in the chapter).or medium-frequency words without any clear necessary Boolean syntax. For example. particularly those who use the system on an infrequent basis (Cleverdon 1983). The ranking approach to retrieval seems to be more oriented toward these end-users.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS CHAPTER 14: RANKING ALGORITHMS Donna Harman National Institute of Standards and Technology Abstract This chapter presents both a summary of past research done in the development of ranking algorithms and detailed instructions on implementing a ranking type of retrieval system. with the records ranked in order of likely relevance.htm (1 of 28)7/3/2004 4:21:22 PM . This type of retrieval system takes as input a natural language query without Boolean syntax and produces a list of records that "answer" the query. "human factors and/or system performance in medical databases" is difficult for end-users to express in Boolean logic because it contains many high. with the results being ranked based on co-occurrence of query terms.Books_Algorithms_Collection2ed/books/book5/chap14. that is. and provides some results even if a query term is incorrect. These end-users are likely to be familiar with the terminology of the data set they are searching. but lack the training and practice necessary to get consistently good results from a Boolean system because of the complex query syntax required by these systems. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. This method eliminates the often-wrong Boolean syntax used by end-users.. and those retrieved documents are not ranked in order of any relationship to the user's query. This approach allows the user to input a simple query such as a sentence or a phrase (no Boolean connectors) and retrieve a list of documents ranked in order of likely relevance. Because of this.1 INTRODUCTION Boolean systems were first developed and marketed over 30 years ago at a time when computing power was minimal compared with today. The main reason the natural language/ranking approach is more effective for end-users is that all the terms in the query are used for retrieval. Although the Boolean systems offer very powerful on-line search capabilities to librarians and other trained intermediaries. it is not the term used in the data. The ranking method would do well with this query.. these systems require the user to provide sufficient syntactical restrictions in their query to limit the number of documents retrieved. The ranking methodology also works well for the complex queries that may be difficult for end-users to express in Boolean logic. 14.

with section 14. the frequency of a term in the given document (as shown in the example). 14." The third section of Figure 14." the first "0" indicates the absence of the word "help. To determine which document best matches the query.8 discusses some topics closely related to ranking and provides some suggestions for further reading in these areas. each term is weighted by the total number of times it appears in the record. the first "1" indicates the presence of the word "factor.2 HOW RANKING IS DONE Assume that a given textual data set uses i unique terms.9 summarizes the chapter. Section 14. For example. and so on. . Section 14. . This term-weighting usually provides substantial improvement in the ranking. and 0's to indicate a lack of those words. a simple dot product of the query vector and each document vector is made (left side of the fourth section) and the results used to rank the documents.." "database. such as the scarcity of a term in the data set (i.htm (2 of 28)7/3/2004 4:21:22 PM .5 summarizes the results from sections 14. where ti has a value of 1 if term i is present.3 presents various theoretical models used in ranking and reviews past experiments using these models." and "collection..1 shows a similar conceptual representation of three documents in this data set." the second "1" indicates the presence of the word "information." depending on the retrieval environment being discussed. or some user-specified term-weight. "information" appears three times in document 1.3 and 14. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. Section 14.2 shows a conceptual illustration of how ranking is done. as are the terms "data set. In the example.4. Section 14. for example. A document can then be represented by a vector (t1. t3. with 1's in vector positions of words included in the query. The second section shows a natural language query and its translation into a conceptual vector. The terms "record" and "document" are used interchangeably throughout this chapter. and 0 if term i is absent in the document.e. t2. Figure 14. presenting a series of recommended ranking schemes for various situations. "human" probably occurs less frequently than "systems" in a computer science data set).1. A query can be represented in the same manner. .6 describes the implementation of a basic ranking retrieval system. These term-weights could reflect different measures. along with brief discussions of the type of ranking methods found in the few operational systems that have used this retrieval technique. and section 14. .1 shows this representation for a data set with seven unique terms. The top section of Figure 14.1 shows the seven terms in this data set.7 showing possible variations to this scheme based on retrieval environments.Books_Algorithms_Collection2ed/books/book5/chap14.4 describes results from several experiments directly comparing various ranking schemes. Section 14. "factors" appears twice in document 1. Section 14. It is possible to perform the same operation using weighted vectors as shown in the right side of the bottom section of Figure 14.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS This chapter describes the implementation of a ranking system and is organized in the following manner. tn).

1: A simple illustration of statistical ranking 14. Results are presented in a roughly chronological order to provide some sense of the development of knowledge about ranking through these experiments. Although it is not necessary to understand the theoretical models involved in ranking in detail in order to implement a ranking retrieval system. but the principle ones are the vector space model and the probabilistic model. based on the cosine correlation used to measure the cosine of the angle between vectors. All the experimental results presented in the models are based on using standard test collections and using standard recall and precision measures for evaluation.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS Figure 14. Ranking models can be divided into two types: those that rank the query against individual documents and those that rank the query against entire sets of related documents. where tdij = the ith term in the vector for document j file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. it is helpful to know about them as they have provided a framework for the vast majority of retrieval experiments that contributed to the development of the ranking techniques used today. those that rank individual documents against the query. The first type of ranking model. where n corresponds to the number of unique terms in the data set. covers several theoretical models.3 RANKING MODELS AND EXPERIMENTS WITH THESE MODELS In 1957 Luhn published a paper proposing a statistical approach to searching literary information. and documents can be ranked based on that similarity.3. including some small-scale experiments in termweighting..htm (3 of 28)7/3/2004 4:21:22 PM . 14. A vector matching operation.Books_Algorithms_Collection2ed/books/book5/chap14.. see Belkin and Croft [1987])." Maron and Kuhns (1960) went much further by suggesting how to actually weight terms. The information retrieval research community has continued to develop many models for the ranking technique over the last 30 years (for an overview. can then be used to compute the similarity between a document and a query.1 The Vector Space Model The sample document and query vectors described in section 14. the higher would be the probability of their representing similar information.2 can be envisioned as an ndimensional vector space. He suggested that "the more two representations agreed in given elements and their distribution.

p. the results varied somewhat depending on the test collection used). 1988).3. These ranking experiments started in 1964 at Harvard University. and experiments with suffixing. These experiments showed that within-document frequency weighting improved performance over no term-weighting (in varying amounts depending on the test collection used). 1983.Books_Algorithms_Collection2ed/books/book5/chap14. 1971. 1973.2 Probabilistic Models Although a model of probabilistic indexing was proposed and tested by Maron and Kuhns (1960). This file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Only those experiments dealing directly with term-weighting and ranking will be discussed here. they show performance improvements using enhanced query term-weighting measures for queries with term frequencies greater than one. and form a large part of the research in information retrieval. 143) using the SMART system tested an overlap similarity function against the cosine correlation measure and tried simple term-weighting using the frequency of terms within the documents. the major probabilistic model in use today was developed by Robertson and Sparck Jones (1976). the inverted document frequency (IDF. Of particular interest in these experiments were the term-weighting schemes relying on term importance within an entire collection rather than only within a given document. and phrases. they showed that use of the cosine correlation with frequency termweighting provided better performance than the overlap similarity because of the automatic inclusion of document length normalization by the cosine similarity function (again. Salton 1971. normalized by the cosine measure. clustering. A recent paper by Salton and Buckley (1988) summarizes 20 years of SMART experiments in automatic term-weighting by trying 287 distinct combinations of term-weighting assignments. synonyms. These weighting schemes are further discussed in section 14. in particular the SMART system experiments under Salton and his associates (1968.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS tqik = the ith term in the vector for query k n = the number of unique terms in the data set This model has been used as the basis for many ranking retrieval experiments.htm (4 of 28)7/3/2004 4:21:22 PM . Early experiments (Salton and Lesk 1968. Besides confirming that the best document termweighting is provided by a product of the within-document term frequency and the IDF. Salton and Yang were able to show significant performance improvement using a term-weighting scheme that combined the within-document frequency weighting with a new term-weighting scheme. The SMART experiments cover many areas of information retrieval such as relevance feedback... Clearly more weight should be given to query terms matching document terms that are rare within a collection. 1981. with or without cosine normalization. moved to Cornell University in 1968.5. A second major set of experiments was done by Salton and Yang (1973) to further develop the termweighting schemes. on six standard collections. Further. Sparck Jones 1972) that is based on the Zipf distribution of a term within the entire collection (see page 373 for the definition of IDF). 14.

. and show that theoretical preference is for F4. formula F4 was superior (closely followed by F3).. with a large drop in performance between the optimal performance and the "predictive" performance. In both cases. In particular. It was also used by Sparck Jones (1975) in devising optimal performance yardsticks for test collections. The experimental verification of the theoretical superiority of F4 provided additional weight to the importance of this new model. Sparck Jones (1979b) tried using this measure (F4 only) in a manner that would mimic a typical on-line session using relevance feedback and found that adding the relevance weighting from only the first couple of relevant documents retrieved by a ranking system still produced file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. even on a large test collection. Robertson and Sparck Jones also formally derive these formulas. N = the number of documents in the collection R = the number of relevant documents for query q n = the number of documents having term t r = the number of relevant documents having term t They then use this table to derive four formulas that reflect the relative distribution of terms in the relevant and nonrelevant documents.htm (5 of 28)7/3/2004 4:21:22 PM . The use of this theory as a predictive device was further investigated by Sparck Jones (1979a) who used a slightly modified version of F1 and F4 and again got much better results for F4 than for F1. They did experiments using all the relevance judgments to weight the terms to see what the optimal performance would be. as would be expected.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS model is based on the premise that terms that appear in previously retrieved relevant documents for a given query should be given a higher weight than if they had not appeared in those relevant documents. Formula F1 had been used by Barkla (1969) for relevance feedback in a SDI service and by Miller (1971) in devising a probabilistic search strategy for Medlars. and propose that these formulas be used for term-weighting (the logs are related to actual use of the formulas in term-weighting). they presented the following table showing the distribution of term t in relevant and nonrelevant documents for query q.Books_Algorithms_Collection2ed/books/book5/chap14. Robertson and Sparck Jones used these four formulas in a series of experiments with the manually indexed Cranfield collection. Formula F4 (minus the log) is the term precision weighting measure proposed by Yu and Salton (1976). and also used relevance judgments from half the collection to weight the terms for retrieval from the second half of the collection.

where Q = the number of matching terms between document j and query k C = a constant for tuning the similarity function ni = the number of documents having term i in the data set N = the number of documents in the data set Experimental results showed that this term-weighting produced somewhat better results than the use of the IDF measure alone. In 1979 Croft and Harper published a paper detailing a series of experiments using probabilistic indexing without any relevance information. Work up to this point using probabilistic indexing required the use of at least a few relevant documents.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS performance improvements. where Q = the number of matching terms between document j and query k file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. and the documents were very short (consisting of titles only). they assume that all query terms have equal probability of occurring in relevant documents and derive a termweighting formula that combines a weight based on the number of matching terms and on a termweighting similar to the IDF measure. a method that was suitable for the manually indexed Cranfield collection used in this study (because it can be assumed that each matching query term was very significant).. Setting C to 1 ranks the documents by IDF weighting within number of matches. with the scaling factor K playing a large part in tuning the weighting to different collections. Starting with a probabilistic restatement of F4.. Being able to provide different values to C allows this weighting measure to be tailored to various collections.htm (6 of 28)7/3/2004 4:21:22 PM . C was set much lower in tests with the UKCIS2 collection (Harper 1980) because the terms were assumed to be less accurate. making this model more closely related to relevance feed-back than to term-weighting schemes of other models. Croft (1983) expanded his combination weighting scheme to incorporate within-document frequency weights. The results show significant improvement over both the IDF weighting alone and the combination weighting. again using a tuning factor K on these weights to allow tailoring to particular collections.Books_Algorithms_Collection2ed/books/book5/chap14.

Models based on fuzzy set theory have been proposed (for a summary. This distribution model proved much less successful because of the difficulty in estimating the many parameters needed for implementation.3 Other Models for Ranking Individual Documents Several other models have been used in developing term-weighting measures.Books_Algorithms_Collection2ed/books/book5/chap14. most notably the 2-Poisson model proposed by Bookstein and Swanson (1974) and implemented and tested by Harter (1975) and Raghavan et al. confirming that within-document term frequency plays a much smaller role in the NPL collection with its short documents having few repeating terms.3. see Bookstein [1985]) but have not received enough experimental implementations to be used in practice (except when combined with file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. 14. Others have tried more complex term distributions.3 for the automatically indexed Cranfield collection. and 0. This model is the subject of Chapter 16 and will not be further discussed here. (1983).4 Set-Oriented Ranking Models The most well known of the set-oriented models are the clustering models where a query is ranked against a hierarchically grouped set of related documents..3.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS IDFi = the IDF weight for term i in the entire collection (see page 373 for the definition of IDF) where freqij = the frequency of term i in document j K = a constant for adjusting the relative importance of the two weighting schemes maxfreqj = the maximum frequency of any term in document j The best value for K proved to be 0.htm (7 of 28)7/3/2004 4:21:22 PM . 14. The inverted document frequency measure heavily used in implementing both the vector space model and the probabilistic model was derived by Sparck Jones (1972) from observing the Zipf distribution curve for collection vocabulary.5 for the NPL collection..

Full-text indexing was used on various standard test collections. the frequency of a term file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. term-weighting made no difference) and an overall significant difference found for uncontrolled vocabulary. The four factors investigated were: the number of matches between a document and a query. Her results showed that using the term frequency (or postings) within a collection always improved performance. and with all queries based on manual keywords. the distribution of a term within a document collection.e.4 OTHER EXPERIMENTS INVOLVING RANKING 14. She selected four term-weighting factors proven important in past research and tried different combinations in order to arrive at an "optimum" termweighting scheme. however. with no significant difference found when using controlled vocabulary (i. The various term-weighting schemes were not combined in this experiment. She used four collections. with indexing generally taken from manually extracted keywords instead of using full-text indexing. and term postings within a collection. The term-weighting results were more mixed.htm (8 of 28)7/3/2004 4:21:22 PM .. term postings within a document (a binary measure). For both controlled and uncontrolled vocabulary he found a significant difference in the performance of similarity measures. with a group of about 15 different similarity measures all performing significantly better than the rest. but that using term frequency (or postings) within a document improved performance only for some collections. He used these to rank results from Boolean retrievals using both controlled (manually indexed) and uncontrolled (full-text) indexing.. McGill et al..1 Direct Comparison of Similarity Measures and TermWeighting Schemes There have been several studies examining the various factors involved in ranking that have not been based on any particular model but have instead used some method of comparing directly various similarity measures and term-weighting schemes. with full-text indexing also done on the queries.Books_Algorithms_Collection2ed/books/book5/chap14. along with normalizing these measures for document length. Sparck Jones (1973) explored different types of term frequency weightings involving term frequency within a document. (1979) examined the literature from different fields to select 67 similarity measures and 39 term-weighting schemes. A different approach was taken by Harman (1986). There was a lack of significant difference between pairs of term-weighting measures for uncontrolled vocabulary. term frequency within a collection. This group included both the cosine correlation and the inner product function used in the probabilistic models.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS Boolean queries such as in the P-Norm discussed in Chapter 15). which could indicate that the difference between linear combinations of term-weighting schemes is significant but that individual pairs of term-weighting schemes are not significantly different. The theory of rough sets has been applied to information retrieval (Srinivasan 1989) but similarly has not been developed far enough to be used in practice.4. 14.

7. The SIRE system. such as within summary paragraphs versus within text paragraphs.Books_Algorithms_Collection2ed/books/book5/chap14. 1989). She found that when using the single measures alone. The noise measure consistently slightly outperformed the IDF (however with no significant difference). tailored to the particular structure of the knowledge base.4. the IDF measure by Sparck Jones and a revised implementation of the "noise" measure (Dennis 1964. A commercial outgrowth of this system. weighting by the use of raw term frequency within documents (for more on the hybrid aspects of this system. uses ranking based on file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. increasing term-weights for terms in titles of documents and decreasing term weights for terms added to a query from a thesaurus. The results were generally superior to those using term-weighting alone. Bernstein and Williamson (1984) built a ranking retrieval system for a highly structured knowledge base. Note that the use of noise here refers to how much a term can be considered useful for retrieval versus being simply a "noisy" term. the distribution of the term within the collection improved performance almost twice as much for the Cranfield collection as using only withindocument frequency.htm (9 of 28)7/3/2004 4:21:22 PM . or alternatively the IDF measure). Their ranking algorithms used not only weights based on term importance both within an entire collection and within a given document. Other collections showed less improvement. although further development is necessary before this approach is fast enough for use in large retrieval systems. A very elaborate weighting scheme was devised for this experiment.3).Information Retrieval: CHAPTER 14: RANKING ALGORITHMS within a document. and a function of term frequency within the entire collection (the noise or entropy measure. A very different approach based on complex intradocument structure was used in the experiments involving latent semantic indexing (Lochbaum and Streeter 1989). see section 14. Combining the within-document frequency with either the IDF or noise measure. document and query structures are also used to influence the ranking. 14.2 Ranking Based on Document Structure Some ranking experiments have relied more on document or intradocument structure than on the termweighting described earlier.3 Ranking Techniques Used in Operational Systems Several operational retrieval systems have implemented ranking algorithms as central to their search mechanism. 14. Two different measures for the distribution of a term within a document collection were used. the Hepatitis Knowledge Base. marketed as Personal Librarian. and the length of the document. an operational information retrieval system (Wade et al. In SIBRIS. but also on the structural position of the term. Salton and McGill 1983). The indexing and retrieval were based on the singular value decomposition (related to factor analysis) of a term-document matrix from the entire document collection. and normalizing for document length improved results more than twice as much as using the IDF or noise alone in the Cranfield collection. as implemented at Syracuse University (Noreault et al. but the same relative merit of the term-weighting schemes was found.. This was combined with weighting using both a function of term frequency within a document (the root mean square normalization). and examines the concentration of terms within documents rather than just the number of postings or occurrences. 1977) built a hybrid system using Boolean searching and a vector-model-based ranking scheme.4..

as no within-document frequencies were available from the MEDLINE files.The IDF measure has been commonly used.5 A GUIDE TO SELECTING RANKING TECHNIQUES In looking at results from all the experiments. 14. The OPAKI project (Walker and Jones 1987) worked with on-line catalogs and also used the IDF measure alone. The use of term-weighting based on the distribution of a term within a collection always improves performance (or at minimum does not hurt performance).ooks_Algorithms_Collection2ed/books/book5/chap14. designed as an interface to MEDLINE (Doszkocs 1982). see section 14. often their ranking algorithms are not clear from publications.2. some trends clearly emerge. where N = the number of documents in the collection file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. where N = the number of documents in the collection ni = the total number of occurrences of term i in the collection maxn = the maximum frequency of any term in the collection A possible alternative is the noise or entropy measure tried in several experiments . This system assigns higher ranks to documents matching greater numbers of query terms than would normally be done in the ranking schemes discussed experimentally. For details on the search system associated with CITE. either in its form as originally used. Although other small-scale operational systems using ranking exist.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS different factors.. and so these are not listed here. The CITE system. ranked documents based solely on the IDF weighting. including the IDF and the frequency of a term within a document. 1.7.htm (10 of 28)7/3/2004 4:21:22 PM . or in a form somewhat normalized..

such as was done by Croft (1983) in using a sliding importance factor K. where file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. First.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS maxnoise = the highest noise of any term in the collection Freqik = the frequency of term i in document k TFreqi = the total frequency of term i in the collection 2.e. and by Salton and Buckley (1988) in providing different combination schemes for term-weighting.. A second reason for the inconsistent improvements found for within-document frequencies is the fact that some collections have very short documents (such as titles only) and therefore within-document frequencies play no role in these collections. both to moderate the effect of high-frequency terms in a document (i. The combination of the within-document frequency with the IDF weight often provides even more improvement. it is very important to normalize the within-document frequency in some manner.ooks_Algorithms_Collection2ed/books/book5/chap14. the effects of within-document frequency may need to be tailored to collections. a term appearing 20 times is not 20 times as important as one appearing only once) and to compensate for document length. several methods can be used for combining these with the IDF measure. Either of the following normalized within-document frequency measures can be safely used.. but the lack of proper normalization techniques in some experiments has likely hidden possible improvements. This normalization has taken various forms in different experiments. This tailoring seems to be particularly critical for manually indexed or controlled vocabulary data where use of within-document frequencies may even hurt performance.htm (11 of 28)7/3/2004 4:21:22 PM . Assuming within-document term frequencies are to be used. The combination recommended for most situations by Salton and Buckley is given below (a complete set of weighting schemes is presented in their 1988 paper). There are several reasons why this improvement is inconsistent across collections. Finally. where freqij = the frequency of term i in document j maxfreqj = the maximum frequency of any term in document j lengthj = the number of unique terms in document j 3..

Whereas there is more flexibility available here than in the cosine measure. User weighting file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. It can be very useful to add additional weight for document structure. with symbol definitions the same as those previously given. Two possible combinations are given below that calculate the matching strength of a query to document j. and to higher values (0. reducing the role of within-document frequency. 4.ooks_Algorithms_Collection2ed/books/book5/chap14. This additional weighting needs to be considered with respect to the particular data set being used for searching. and to use only binary weighting of documents (Wij = 1 or 0) for collections with short documents or collections using controlled vocabulary..htm (12 of 28)7/3/2004 4:21:22 PM . where C should be set to low values (near 0) for automatically indexed collections. the need for providing normalization of withindocument frequencies is more critical.3 was used by Croft) for collections with long (35 or more terms) documents.. Many combinations of term-weighting can be done using the inner product. such as higher weightings for terms appearing in the title or abstract versus those appearing only in the text.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS and wij = freqij X IDFi freqiq = the frequency of term i in query q maxfreqj = the maximum frequency of any term in query maxfreqq IDFi = the IDF of term i in the entire collection freqij = the frequency of term i in document j Salton and Buckley suggest reducing the query weighting wiq to only the within-document frequency (freqiq) for long queries containing multiple occurrences of terms. One alternative ranking using the inner product (but without adjustable constants) is given below. K should be set to low values (0.5 or higher) for collections with short documents. and to higher values such as 1 for manually indexed collections.

Information Retrieval: CHAPTER 14: RANKING ALGORITHMS can also be considered as additional weighting. It would be feasible to use structures other than simple inverted files. Relevance weighting is discussed further in Chapter 11 on relevance feedback.4. with cross-references made to these enhancements throughout this section.. An enhancement to the indexing program to allow easier updating is given in section 14.6.7.. although this type of weighting has generally proven unsatisfactory in the past. and that the list of ranked record id numbers that is returned by the search process is used as input to some routine which maps these ids onto data locations and displays a list of titles or short data descriptors for user selection. created once per major update (thus only once for a static data set). Except for data sets with critical hourly updates (such as stock quotes).1 The Creation of an Inverted File The inverted file described here is a modification to the inverted files described in Chapter 3 on that subject. The implementation will be described as two interlocking pieces: the indexing of the text and the using (searching) of that index to return a ranked list of record identification numbers (ids).7. The use of a ranking system instead of a Boolean retrieval system has several important implications for supporting inverted file structures. The description of the search process does not include the interface issues or the actual data retrieval issues.5). The index shown is a straightforward inverted file. It is assumed that a natural language query is passed to the search process in some manner. The penalty paid for this efficiency is the need to update the index as the data set changes.htm (13 of 28)7/3/2004 4:21:22 PM . such as the more complex structures mentioned in that chapter. as long as the elements needed for ranking are provided. The use of relevance weighting after some initial retrieval is very effective. the use of these indices improves efficiency by several orders of magnitude. Modifications of this implementation that enhance its efficiency or are necessary for other retrieval environments are given in section 14. 14. The use of ranking means that there is little need for the adjacency operations or field restrictions file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.8. and is used to provide the necessary speed for searching. 5. Although it is possible to build a ranking retrieval system without some type of index (either by storing and searching all terms in a document or by using signature files in ranking such as described in section 14.ooks_Algorithms_Collection2ed/books/book5/chap14. 14. 1.6 DATA STRUCTURES AND ALGORITHMS FOR RANKING This section will describe a simple but complete implementation of the ranking part of a retrieval system. this is generally not a problem.

2.8. In the area of stoplists.7. and then a pointer to the location of the postings file for that term. For example. The inverted file presented here will assume that only record location is necessary.4) can be used that does not increase storage but increases response time when using these particular operations. or an alternative method (see section 14. The above illustration is a conceptual form of the necessary files.123 unique terms with an average of 14 postings/term and a maximum of over 2. and so on) as described for Boolean inverted files. record ids.. In the area of stemming. A larger data set of 38. The use of ranking means that strategies needed in Boolean systems to increase precision are not only unnecessary but should be discarded in favor of strategies that increase recall at the expense of precision. Figure 14.653 records. In the area of parsing. This implies that the file to be searched should be as short as possible. only the record id has to be stored as the location for each word.1.304 records had dictionaries on the order of 250.htm (14 of 28)7/3/2004 4:21:22 PM .2: Inverted file with frequency information This format is based on the search methods and the weighting methods used. the ultra-high frequency term "computer" may be in a stoplist for Boolean systems but would not need to be considered a common word for ranking systems. it may mean a less restrictive stoplist. there were 5. Harman and Candela 1990) rather than by forcing the user to ask for expansion by wild-cards. it is usually processed into an improved final format. a ranking system seems to work better by automatically expanding the query using stemming (Frakes 1984.000 postings for a term. Although an inverted file with frequency information (Figure 14. In this manner the dictionary used in the binary search has only one "line" per unique term.ooks_Algorithms_Collection2ed/books/book5/chap14. this may mean relaxing the rules about hyphenation to create indexing both in hyphenated and nonhyphenated form.2) could be used directly by the search routine. The postings file contains the record ids and the weights for all occurrences of the term.000 unique terms. If it is determined that the ranking system must also handle adjacency or field restrictions. Therefore.000 lines (250. in a data set about computers. then either the index must record the additional location information (field location. along with statistics about that term such as number of postings and IDF. the actual form depends on the details of the search routine and on the hardware being used. A more appropriate stemming strategy for ranking therefore is to use stemming in creation of the inverted file. An enhancement of this stemming option would be to allow the user to specify a "don't stem" character.. and for this reason the single file shown containing the terms. word position within record. including some numerals) and an file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Work using large data sets (Harman and Candela 1990) showed that for a file of 2. and the modifications necessary to handle this are given in section 14. creating a much smaller index than for Boolean systems (in the order of 10% to 15% of the text size).Information Retrieval: CHAPTER 14: RANKING ALGORITHMS necessary in Boolean. but a common search technique (and the one discussed here) is to use a binary search routine on the file to locate the query words. and frequencies is usually split into two pieces for searching: the dictionary containing the term.

both parts of the index must be processed from disk. However. Somewhat less ideally. but the most flexible system in that term-weighting algorithms can be changed without changing the index. From these statistics it is clear that efficient storage structures for both the binary search and the reading of the postings are critical. assuming appropriate record statistics have been stored during parsing. Any of the normalized frequencies shown in section 14. This produces the slowest search (likely much too slow for large data sets).. including those using the cosine similarity function.3: A dictionary and postings file 1. If this is the actual weight stored. The dictionary and postings file shown (Figure 14. Ideally. and provides very fast response time. Using Harman's normalized frequency as an example. Store the raw frequency. only the dictionary could be stored in memory. rather than first multiplying by the IDF of the term. Store the completely weighted term.ooks_Algorithms_Collection2ed/books/book5/chap14. both files could be read into memory when a data set is opened. Recent work on the effective use of inverted files suggests better ways of storing and searching these files (Burkowski 1990. each having advantages and disadvantages. The disadvantage of this option is that updating requires changing all postings because the IDF is an integral part of the posting (and the IDF measure changes as any file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 3. The advantage of this term-weighting option is that updating (assuming only the addition of new records and not modification of old ones) would not require the postings to be changed. and this normalized frequency would be inserted in the postings file in place of the raw frequency shown. This was the option taken by Harman and Candela (1990) in searching on 806 megabytes of data. providing a heavy overhead per posting.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS average of 88 postings per record. This operation would be done during the creation of the final dictionary and postings file. Store a normalized frequency. The same procedure could be done for Croft's normalized frequency or any other normalized frequency used in an inner product similarity function.. however. There are four major options for storing weights in the postings file.htm (15 of 28)7/3/2004 4:21:22 PM . the raw frequency for each term from the final table of the inversion process would be transformed into a log function and then divided by the log of the length of the corresponding record (the lengths of the records were collected and saved in the parsing step). Cutting and Pedersen 1990). More details of the storage and use of these files is given in the description of the search process. with disk access for the postings file. this option is not suitable for use with the cosine similarity function using Salton's method as the normalization for length includes the IDF factor. This option allows a simple addition of each weight during the search process.3) stores a term-weight of simply the raw frequency of a term in a record. Again. Usually. Figure 14. 2.5 can be used to translate the raw frequency to a normalized frequency. any of the combination weighting schemes shown in section 14.5 are suitable. This option would improve response time considerably over option 1. then all the calculations of term-weights must be done in the search routine itself. although option 3 may be somewhat faster (depending on search hardware).

Additionally. then the full term-weight must be calculated. If the query term is not common.) If the stem is found in the dictionary.7. see Knuth [1973]. If option 3 was used for weighting. The search time for this method is heavily dependent on the number of retrieved records and becomes prohibitive when used on large data sets.ooks_Algorithms_Collection2ed/books/book5/chap14. If option 1 was used for weighting. If option 2 was used for weighting. This makes the searching process relatively independent of the number of retrieved records--only the sort for the final set of ranks is affected by the number of records being sorted. along with the corresponding IDF and the number of postings. and finally sort those records. then the postings records do not have to store weights. In this method. see Doszkocs [1982]). All processing would be done in the search routines. This was the method chosen for the basic search process (see Figure 14. with each term then checked against the stoplist for removal of common terms. A block of storage containing an "accumulator" for every unique record id is reserved.4..htm (16 of 28)7/3/2004 4:21:22 PM .Information Retrieval: CHAPTER 14: RANKING ALGORITHMS additions are made to the data set). then the weight stored in the postings is the normalized frequency of the stem in that record. then this total is immediately available and only a simple addition is needed. a block of storage was used as a hash table to accumulate the total record weights by hashing on the record id into unique "accumulator" addresses (for more details. If no within-record weighting is used. 14.4: Flowchart of search engine The next step is to use the address of the postings and the number of postings to read the record ids. This process can be made much less dependent on the number of records retrieved by using a method developed by Doszkocs for CITE (Doszkocs 1982). The query is parsed using the same parser that was used for the index creation. the address of the postings list for that stem is returned.2 Searching the Inverted File One way of using an inverted file to produce statistically ranked output is to first retrieve all records containing the search terms. Loading the necessary record statistics. usually on the order of 300 Kbytes for large data sets. relevance feedback reweighting is difficult using this option.. Figure 14.4). (For algorithms to do efficient binary searches. with the total term-weight for each record id being added to the contents of its unique accumulator. 4.6. as the weight stored in the posting is the raw frequency of the stem in that record. then use the weighting information for each term in those records to compute the total weight for each of those retrieved records. into memory before searching is essential to maintain any reasonable response time for file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. it is then passed through the stemming routine and a binary search for that stem is executed against the dictionary. and for an alternative to binary searching see section 14. and this needs to be multiplied by the IDF of that stem before the addition. such as record length.

A read of one byte essentially takes the same time as a read of many bytes (a buffer full) and this factor can be utilized by doing a single read for all the postings of a given term. This same logic could be applied to the binary search of the dictionary.7. An enhancement can be made to reduce the number of records sorted (see section 14. This requires a sequential storage of the postings in the index. the total time savings may be considerable. and some of which are techniques for enhancing response time or improving ease of updating. Harman and Candela (1990) found that almost every user query had at least one term that had postings in half the data set. see section 14.htm (17 of 28)7/3/2004 4:21:22 PM . A second time savings can be gained at the expense of some memory space. The level of detail is somewhat less than in section 14. Some time is saved by direct access to memory rather than through hashing. and usually at least three quarters of the data set was involved in most queries. As some terms have thousands of postings for large data sets.. The time saved may be considerably less. Whereas the storage for the "accumulators" can be hashed to avoid having to hold one storage area for each data set record.6. A final time savings on I/O could be done by loading the dictionary into memory when opening a data set.7. Each of the following topics deals with a specific set of changes that need to be made in the basic indexing and/or search routines to allow the particular enhancement being discussed. and as many unique postings are involved in most queries. As each query term is processed. accumulators with nonzero weights are sorted to produce the final ranked record list. its postings cause further additions to the accumulators.2). the I/O needs to be minimized.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS this weighting option.7 MODIFICATIONS AND ENHANCEMENTS TO THE BASIC INDEXING AND SEARCH PROCESSES There are many possible modifications and enhancements to the basic indexing and search processes. which takes about 14 reads per search for the larger data sets. either because less detail is available or because the implementation of the technique is complex and details are left out in the interest of file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. First. There are several major inefficiencies of this technique. and the number of postings (also stored in the dictionary) being used to control the length of the read (and the separation of the butfer). doing a separate read for each posting can be very time-consuming.5). this is definitely not necessary for smaller data sets. with the postings pointer in the dictionary being used to control the location of the read operation. Even a fast sort of thousands of records is very time consuming. and then separating the buffer into record ids and weights.ooks_Algorithms_Collection2ed/books/book5/chap14. however. some of which are necessary because of special retrieval environments (those involving large and very large data sets are discussed).. When all the query terms have been handled. and may not be useful except for extremely large data sets such as those used in CITE (which need even more modification. 14. A final major bottleneck can be the sort step of the "accumulators" for large data sets.

for data sets that are relatively small it is best to use the two separate inverted files because the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. a stem is produced that leads to improper results. but saving considerable storage over that needed to store two versions of the postings. and contains the stem.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS space.6 suggest no manner of coping with this problem.. causing query failure. The hybrid postings list saves the storage necessary for one copy of the record id by merging the stemmed and unstemmed weight (creating a postings element of 3 positions for stemmed terms).6. the term. A hybrid inverted file was devised to merge these files. An example of the merged inverted file is shown in Figure 14. some of the implementations discussed here should be used with caution as they are usually more experimental. Figure 14.1 Handling Both Stemmed and Unstemmed Query Terms It was observed by Frakes (1984) and confirmed by Harman and Candela (1990) that if query terms were automatically stemmed in a ranking system. and may have unknown problems or side effects. the number of postings and IDF of the term. only their stems are used. one for stems and one for the unstemmed terms. and postings lists must be handled more carefully as some terms have three elements in their postings list and some have only two. Note that the merged dictionary takes one line per unstemmed term. unlike section 14. it creates a storage problem for the large data sets. This storage savings is at the expense of some additional search time and therefore may not be the optimal solution. and resulting in longer binary searches for most terms (which will be stemmed). Clearly. The basic indexing and search processes described in section 14.5: Merged dictionary and postings file As can be expected. In some cases. and the offset of the postings for this stem/ term combination. Clearly two separate inverted files could be created and stored. 14. as the original record terms are not stored in the inverted file.7. making it considerably larger than the stemmed dictionary. Each query term that is stemmed must now map to multiple dictionary entries. but query terms marked with a "don't stem" character would be routed to the unstemmed version. Terms that have no stem for a given data set only have the basic 2-element postings record.ooks_Algorithms_Collection2ed/books/book5/chap14.5. It should be noted that. but it is not thought to be an optimal method. Whereas this would solve the problem for smaller data sets. The following technique was developed for the prototype retrieval system described in Harman and Candela (1990) to handle this problem. the number of postings and IDF of the stem. users generally got better results. saving no space in the dictionary part. a bit to indicate if the term is stemmed or not stemmed. with the terms sorted within the stem. Query terms would normally use the stemmed version. however. the search process needs major modifications to handle these hybrid inverted files. This hybrid dictionary is in alphabetic stem order.htm (18 of 28)7/3/2004 4:21:22 PM .

However. and references are made to these in section 14.2 Searching in Very Large Data Sets In 1982. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. the accumulators hold only a subset of the records and each subset is processed as if it were the entire data set. The basic search process is therefore unchanged except that instead of each record of the data set having a unique accumulator. Doszkocs solved the problem in his experimental front-end to MEDLINE (the CITE system) by segmenting the inverted file into 8K segments.6). and this is continued until the entire Boolean query is processed. each holding about 48.000 on-line records. For more details see Doszkocs (1982).8. and McGill 1977) incorporates a full Boolean capability with a variation of the basic search process. but this has not been tried.000 records. A Boolean query is processed in two steps.4).3 A Boolean System with Ranking There are many ways to combine Boolean searches and ranking. If ranked output is wanted. they seldom request to search many segments. A simple extension of the basic search process in section 14. limits the Boolean capability and increases response time when using Boolean operators. both mapping to the same hybrid posting file.htm (19 of 28)7/3/2004 4:21:22 PM .7. This extension.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS storage savings are not large enough to justify the additional complexity in indexing and searching. none of these schemes involve extensions to the basic search process in section 14.6. Note that this combining of sets for complex Boolean queries can be a complicated operation.7. MEDLINE had approximately 600.6 can be made that allows noncomplex Boolean statements to be handled (see section 14. with each set of results shown to the user. 14.000 per month (Doszkocs 1982). with records being added at a rate of approximately 21. After stemming. The record ids and raw frequencies for the term being processed are combined with those of the previous set of terms according to the appropriate Boolean logic. A running sum containing the numerator of the cosine similarity is updated by adding the new record frequencies. Because users are often most concerned with recent records. and the records are sorted based on their similarity to the query. The user may request ranked output..8. The subsetting or segmenting is done in reverse chronological order. The SIRE system (Noreault. Possibly the use of two separate dictionaries. and then hashing these record addresses into the fixed block of accumulators. The use of the fixed block of storage to accumulate record weights that is described in the basic search process (section 14. 14. the denominator of the cosine is computed from previously stored document lengths and the query length. each term in the query is checked against the inverted file (this could be done by using the binary search described in section 14.. Koll.ooks_Algorithms_Collection2ed/books/book5/chap14. The system accepts queries that are either Boolean logic strings (similar to many commercial on-line systems) or natural language queries (processed as Boolean queries with implicit OR connectors between all query terms). would improve search time without the loss of storage efficiency. however.3.6) becomes impossible for this huge data set. Very elaborate schemes have been devised that combine Boolean with ranking.

This is not a major factor for small data sets and for some retrieval environments. and other such types of Boolean operations are desired.7. which becomes a series of linked variable length lists capable of infinite update expansion. this enhancement probably has a faster response time for Boolean queries. Their inverted file consists of the dictionary containing the terms and pointers to the postings file. field restrictions.htm (20 of 28)7/3/2004 4:21:22 PM .6. or for environments where ease of update and flexibility are more important than query response time. the inverted file could have a structure more conducive to updating. For smaller data sets.ooks_Algorithms_Collection2ed/books/book5/chap14. This necessity for ease of update also changes the postings structure.6. with no reordering for updates. This system therefore is much more flexible and much easier to update than the basic inverted file and search process described in section 14. The term-weighting is done in the search process using the raw frequencies stored in the postings lists. Note that the binary search described in the basic search process could be replaced with the hashing method to further decrease response time for searching using the basic search process. This was done in Croft's experimental re trieval system (Croft and Ruggles 1984). There are no modifications to the basic inverted file needed unless adjacency. 14. the processing of the linked postings records and the search-time term-weighting will hurt response time considerably. 14. but would not affect the postings lists (which would be sequentially stored for search time improvements). any of the term-weighting functions described in section 14.5 Pruning A major time bottleneck in the basic search process is the sort of the accumulators for large data sets.. Instead it is a bucketed (10 slots/bucket) hash table that is accessed by hashing the query terms to find matching entries. Although the hash access method is likely faster than a binary search. This would require a different organization of the final inverted index file that contains the dictionary. Koll and McGill [1977])..6 assumes a fairly static data set or a willingness to do frequent updates to the entire inverted file.4 Hashing into the Dictionary and Other Enhancements for Ease of Updating The basic inverted file creation and search process described in section 14. especially those involved in research into new retrieval mechanisms. Not only is this likely to be a faster access method than the binary search. but it also creates an extendable dictionary. Buckley and Lewit (1985) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. As the final computations of the similarity measure and the sorting of the ranks are done only for those records that are selected by the Boolean logic. The major modification to the basic search process is to correctly merge postings from the query terms based on the Boolean logic in the query before ranking is done. and no increase in response time for natural language queries compared to the basic search process described in section 14. but the dictionary is not alphabetically sorted.5 could be used. Various methods have been developed for dealing with this problem.7.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS Whereas the cosine similarity is used here with raw frequency term-weighting only (at least in the experiment described in Noreault.

ooks_Algorithms_Collection2ed/books/book5/chap14. Sort the accumulators with nonzero weights to produce the final ranked record list. looking for an algorithm that not only improved response time. with an example of the time savings that can be expected using a pruning technique on a large data set. and 4. Note that records containing only high-frequency terms will not have any weight added to their accumulator and therefore are not sorted. as they are seldom. but do not add weights to zero weight accumulators. The test queries are those brought in by users during testing of a prototype ranking retrieval system. Check the IDF of the next query term. if ever.. A check needs to be made after step 1 for this. the highest IDF) and get the address of the postings list for that term. If the IDF is greater than or equal to one third the maximum IDF of any term in the data set. 3. then repeat steps 2.1:: Response Time Size of Data Set 1. Their changed search algorithm with pruning is as follows: 1. that is. If a query has only high-frequency terms (several user queries had this problem). These records can be retrieved in the normal manner. Harman and Candela (1990) experimented with various pruning algorithms using this method. useful.e. but did not significantly hurt retrieval results. 4. 3. Sort all query terms (stems) by decreasing IDF value. Otherwise repeat steps 2. but pruned before addition to the retrieved record list (and therefore not sorted).Information Retrieval: CHAPTER 14: RANKING ALGORITHMS presented an elaborate "stopping condition" for reducing the number of accumulators to be sorted without significantly affecting performance. but serve only to increase sort time. 6. not select a new record. 3. The following method serves only as an illustration of a very simple pruning procedure. Perry and Willett (1983) and Lucarella (1983) also described methods of reducing the number of cells involved in this final sort..1 shows some timing results of this pruning algorithm. These records are still sorted. Do a binary search for the first term (i. Table 14. 2. Table 14. This method is based on the fact that most records for queries are retrieved based on matching only query terms of high data set frequency.6 Meg 50 Meg 268 Meg 806 Meg -------------------------------------------------------Number of queries 13 38 17 17 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. then pruning cannot be done (or a fancier algorithm needs to be created).. 5.htm (21 of 28)7/3/2004 4:21:22 PM . and 4. Read the entire postings file for that term into a buffer and add the term weights for each record id into the contents of the unique accumulator for the record id. high-frequency (low IDF) terms are allowed to only increment the weights of already selected record ids.

38 1.58 1.1 3. and would be longer if the data set could not be processed in parallel.8 TOPICS RELATED TO RANKING 14. The other pruning techniques mentioned earlier should result in the same magnitude of time savings.5 3. Relevance feedback was one of the first features to be added to the basic SMART file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (22 of 28)7/3/2004 4:21:22 PM . making pruning techniques an important issue for ranking retrieval systems needing fast response times.8.1 1. As can be seen.1 797 2843 5869 22654 4.2 2.5 14.ooks_Algorithms_Collection2ed/books/book5/chap14. 0.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS -------------------------------------------------------Average number of terms per query -------------------------------------------------------Average number of records retrieved -------------------------------------------------------Average response time per query (no pruning) -------------------------------------------------------Average response time per query (pruning) The response time for the 806 megabyte data set assumes parallel processing of the three parts of the data set.28 0. the response times are greatly affected by pruning.5 3.1 Ranking and Relevance Feedback Ranking retrieval systems and relevance feedback have been closely connected throughout the past 25 years of research..6 0..6 4.

2 Ranking and Clustering Ranking retrieval systems have also been closely associated with clustering. and is the foundation for the probabilistic indexing model (Robertson and Sparck Jones 1976). see Chapter 16. 14.4 Use of Ranking in Two-level Search Schemes The basic ranking search methodology described in the chapter is so fast that it is effective to use in situations requiring simple restrictions on natural language queries." Although this seems a tedious method of handling phrases or field file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.8. Early efforts to improve the efficiency of ranking systems for use in large data sets proposed the use of clustering techniques to avoid dealing with ranking the entire collection (Salton 1971). For further details. The input query is processed similarly to a natural language query." or all documents containing "Willett" have their author field checked for "Willett. These situations can be accommodated by the basic ranking search system using a two-level search. several attempts have been made to integrate the ranking model and the Boolean model (for a summary. Examples of these types of restrictions would be requirements involving Boolean operators. The list of ranked documents is returned as before. 14.ooks_Algorithms_Collection2ed/books/book5/chap14. specific authors. For further details on clustering and its use in ranking systems. see Chapter 11.8. that is each document marked as containing "nearest" and "neighbor" is passed through a fast string search algorithm looking for the phrase "nearest neighbor. 14. This method is well described in Salton and Voorhees (1985) and in Chapter 15. Using the following examples clustering using "nearest neighbor" techniques efficient clustering techniques [Author Willett] the queries would be parsed into single terms and the documents ranked as if there were no special syntax. see Bookstein [1985]).3 Ranking and Boolean Systems Because of the predominance of Boolean retrieval systems.8. This usually requires a second pass over the actual document. An efficient file structure is used to record which query term appears in which given retrieved document.. retrieval will be further improved by the addition of this query modification technique. It was also suggested that clustering could improve the performance of retrieval by pregrouping like documents (Jardine and van Rijsbergen 1971). special publication dates. proximity operators. The only methodology for this that has received widespread testing using the standard collections is the P-Norm method allowing the use of soft Boolean operators. Whereas ranking can be done without the use of relevance feedback.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS system (Salton 1971). or the use of phrases instead of simple terms..htm (23 of 28)7/3/2004 4:21:22 PM . except that the system notes the presence of special syntax denoting phrase limits or other field or proximity limitations. but only documents passing the added restriction are given to the user.

pp. J. BERNSTEIN. Croft and Savino (1988) provide a ranking technique that combines the IDF measure with an estimated normalized within-document frequency. Signature files have also been used in SIBRIS." in Annual Review of Information Science and Technology." J. which is based on a two-stage search using signature files for a first cut and then ranking retrieved documents by term-weighting. and detailed the actual implementation of a basic ranking retrieval system. and R. 35(4). and W. "Retrieval Techniques. M. see Chapter 4 on that subject). an operational information retrieval system (Wade et al. 1985. ed. Bedford. Williams." Paper presented at the Second International Cranfield Conference on Mechanized Information Storage and Retrieval Systems. England. BARKLA. B.. 14. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. E. Cranfield.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS restrictions. 1989). New York: Elsevier Science Publishers. M. N.. This method was used in the prototype built by Harman and Candela (1990) and provided a very effective way of handling phrases and other limitations without increasing indexing overhead. J. using simple modifications of the standard signature file technique (see the chapter on signature files). "Testing of a Natural Language Retrieval System for a Full Text Knowledge Base. New York: Knowledge Industry Publications. Extensions to this basic system have been shown that modify the basic system to efficiently handle different retrieval environments." in Williams. "Probability and Fuzzy-Set Applications to Information Retrieval. it can be done in parallel with user browsing operations so that users are often unaware that a second processing step is occurring. M.). ed.9 SUMMARY This chapter has presented a survey of statistical ranking models and experiments.8. That study also suggests that the ability of a ranking system to use the smaller inverted files discussed in this chapter makes storage and efficiency of ranking techniques competitive with that of signature files. 1987. Annual Review of Information Science and Technology.htm (24 of 28)7/3/2004 4:21:22 PM . REFERENCES BELKIN. Inc.. (Ed. American Society for Information Science. 117-51. K. M. 109-45. Williams. pp. CROFT. 14. 235-47. 1969. 1984. WILLIAMSON. BOOKSTEIN. L.ooks_Algorithms_Collection2ed/books/book5/chap14.5 Ranking and Signature Files It is possible to provide ranking using signature files (for details on signature files. A. "Construction of Weighted Term Profiles by Measuring Frequency and Specificity in Relevant Items.

1988. J. 42-62." J." Paper presented at ACM Conference on Research and Development in Information Retrieval. and M.. 67-80... 24(3). C." in Research and Development in Information Retrieval. W. B." Paper presented at ACM Conference on Research and Development in Information Retrieval. W. 1990. "Optimizations for Dynamic Inverted Index Maintenance. "Optimizing Convenient Online Access to Bibliographic Databases. Berlin: Springer-Verlag. Belgium. 312-19." ACM Transactions on Office Information Systems. 1979.. 1964. 1990. MARON. Association for Computing Machinery. Belgium.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS BOOKSTEIN. CROFT. LEWIT. Brussels. F. CROFT." Paper presented at the Statistical Association Methods for Mechanized Documentation. HARPER. A. 28-37. BURKOWSKI. CROFT. 1984. 285-95." Information Technology: Research and Development. and D. PEDERSEN. "Experiments with Representation in a Document Retrieval System." Paper presented at the Eighth International Conference on Research and Development in Information Retrieval. "Surrogate Subsets: A Free Space Management Strategy for the Index of a Text Retrieval System. G. BUCKLEY." Information Services and Use. "The Construction of a Thesaurus Automatically from a Sample of Text. W. 4(1/2). and A. CLEVERDON. and D. Salton and H. American Society for Information Science. (National Bureau of Standards Miscellaneous Publication 269).." J. B. "The Implementation of a Document Retrieval System. Schneider." Documentation. J. 1978. SAVINO. "Optimization of Inverted Vector Searches. and L. BOOKSTEIN.. SWANSON.htm (25 of 28)7/3/2004 4:21:22 PM . CROFT.ooks_Algorithms_Collection2ed/books/book5/chap14. "Probabilistic Models for Automatic Indexing. 25(1)." J. "Foundations of Probabilistic and Utility-Theoretic Indexing. "Using Probabilistic Models of Document Retrieval Without Relevance Information. 6(1).. 1983. and J. W. pp. B. D. RUGGLES. Brussels. R. DENNIS. W. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 418-27. S. 25.. 35(4). CUTTING. and P. and D. KRAFT. "Operations Research Applied to Document Indexing and Retrieval Decisions. J. 1977. B. Montreal. F. 1-21.. 37-47. A. Association for Computing Machinery. "Implementing Ranking Strategies Using Text Signatures. Canada. 2(1). 1985.. 1974. 1983. S. C. COOPER. eds. E.

American Society for Information Science." Information Processing and Management. Salton and H. Doctoral dissertation. Cambridge. Report from the School of Information Studies. England. "Term Conflation for Information Retrieval. 251-62. The Art of Computer Programming. 1983. L. HARTER. 665-76. HARPER." J. of Information Science. 1980. Cambridge. and J. E. H. M. 1989." J. 1990.. S. 1971. eds. Mass. D. "A Document Retrieval System Based on Nearest Neighbor Searching. An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. "On Relevance. J." Paper presented at ACM Conference on Research and Development in Information Retrieval. MCGILL. W. England. N. "A Probabilistic Approach to Automatic Keyword Indexing. KNUTH. 26(5).. 1986. E. 7(5). T. LUHN. J. 6. 1975. 280-89. 1984.. K. Association for Computing Machinery. A. 1973. STREETER. Reading. "A Statistical Approach to Mechanized Encoding and Searching of Literary Information. Relevance Feedback in Document Retrieval Systems: An Evaluation of Probabilistic Strategies. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. G. in press. Syracuse. KUHNS." J. and T. 309-17. D. 1982." in Research and Development in Information Retrieval.ooks_Algorithms_Collection2ed/books/book5/chap14. and C." IBM J. D.. D. 1960. M." Information Storage and Retrieval. and L. Pisa. Probabilistic Indexing and Information Retrieval. D. 25(6). JARDINE. 1(4). 217-40. 1957. VAN RIJSBERGEN. M. American Society for Information Science. Schneider. CANDELA. E.. J.htm (26 of 28)7/3/2004 4:21:22 PM . Syracuse University. 7(3)." Paper presented at the Third Joint BCS and ACM symposium on Research and Development in Information Retrieval. New York. KOLL.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS DOSZKOCS. "From Research to Application: The CITE Natural Language Information Retrieval System. LUCARELLA. 1979. LOCHBAUM. "The Use of Hierarchic Clustering in Information Retrieval. Berlin: Springer-Verlag. Jesus College. MARON. FRAKES. P. HARMAN. "Comparing and Combining the Effectiveness of Latent Semantic Indexing and the Ordinary Vector Space Model for Information Retrieval. P. B." J. 25-33. "Retrieving Records from a Gigabyte of Text on a Minicomputer using Statistical Ranking. pp... NOREAULT. E. Italy. 216-44. HARMAN. and G.: Addison-Wesley. "An Experimental Study of Factors Important in Document Ranking. Research and Development.

"Automatic Ranked Output from Boolean Searches in SIRE.. SALTON. E. V. Bethesda." J. NOREAULT.. SALTON. 31(4). K. 351-72. YU. G. 266-72. S. New York: McGraw-Hill. A. WILLETT. N. "The Measurement of Term Importance in Automatic Indexing.ooks_Algorithms_Collection2ed/books/book5/chap14. and K. 1971.. "Index Term Weighting. KOLL. "Term-Weighting Approaches in Automatic Text Retrieval. and P.. Documentation. E. K. "Relevance Weighting of Search Terms. V. Englewood Cliffs.. 619-33. and C. YANG. 1983. and M. 1981.. SPARCK JONES. RAGHAVAN. P. T. YU. MCGILL. T. Documentation. G. M. 24(5)." Information Processing and Management. 8-36. S. 1973... 1977. SHI. 1973. G. 15(1). Documentation. 9(11). 513-23. 333-39. "Evaluation of the 2-Poisson Model as a Basis for Using Term Frequency Data in Searching. Association for Computing Machinery. Documentation. 1971.. 1975. 28(1).Information Retrieval: CHAPTER 14: RANKING ALGORITHMS MILLER. "On the Specification of Term Values in Automatic Indexing. T." J. 1968. "A Review of the Use of Inverted Files for Best Match Searching in Information Retrieval Systems. and C. 29(4). Introduction to Modern Information Retrieval. 59-66. SPARCK JONES. MCGILL. 1983. The SMART Retrieval System -. PERRY." J. SPARCK JONES. K. Maryland. 1972. ROBERTSON. S. and M.: Prentice Hall. G." Information Storage and Retrieval. LESK.Experiments in Automatic Document Processing. and C. "A Probabilistic Search Strategy for Medlars. 27(3). G. and M." J. "Computer Evaluation of Indexing and Text Processing. SALTON. "A Statistical Interpretation of Term Specificity and Its Application in Retrieval. 129-46. 27(4). SALTON.J.. WU. 1983. SALTON." J. 1976. 254-66. "A Performance Yardstick for Test Collections. American Society for Information Science. 32(3). H.htm (27 of 28)7/3/2004 4:21:22 PM . 1988. American Society for Information Science. G." J. L. 11-20." J. Information Science." Paper presented at the Sixth International Conference on Research and Development in Information Retrieval. 175-86.. 6." J. SPARCK JONES. W. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD." J. American Society for Information Science. BUCKLEY. and C. H. 28(6). SALTON.

and G. M. British Library Research Paper 24.htm (28 of 28)7/3/2004 4:21:22 PM . Go to Chapter 15 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 347-61. VAN RIJSBERGEN. Documentation. P. 1976. WALKER. JONES. 25(4)." Information Processing and Management. 30-48...An Effective Automatic Indexing Method. 23(1). 35(1). 294-317. K. Association for Computing Machinery. C.. 1979b. K." J. 1976. 1979a." J. 1989. YU. 15(3). Documentation.Information Retrieval: CHAPTER 14: RANKING ALGORITHMS SPARCK JONES. Improving Subject Retrieval in Online Catalogues. "Experiments in Relevance Weighting of Search Terms. Information Science. "File Organization in Library Automation and Information Retrieval. K. 133-44. "Precision Weighting -. WILLETT. C. P. "Search Term Relevance Weighting Given Little Relevance Information. J. 15." Information Processing and Management. SALTON.ooks_Algorithms_Collection2ed/books/book5/chap14. J. BAWDEN. WADE.." J. and D. S. 1981. SPARCK JONES. 1987.. S. 1989. "SIBRIS: the Sandwich Interactive Browsing and Ranking Information System." J. and R. 249-60. Information Retrieval Experiment. SPARCK JONES. 76-88. "Intelligent Information Retrieval Using Rough Set Approximations. 32(4). T. London: Butterworths. SRINIVASAN.

Salton. (1988). Betrabet. This limitation of the Boolean model is due to its strict interpretation of the Boolean operators. This chapter discusses three such models: the Mixed Min and Max (MMM).. while the P-norm scheme is a distance-based approach. and not. Boolean retrieval systems are also capable of giving high performance in terms of recall and precision if the query is well formulated. Cooper (1988). 15. Our experimental studies have shown that such a strict interpretation is not compatible with the user's interpretation. the Paice. Fox. Paice (1984). A document indexed by all but one of the above terms will not be retrieved in response to this query.Books_Algorithms_Collection2ed/books/book5/chap15. Blacksburg.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS CHAPTER 15: EXTENDED BOOLEAN MODELS E. VA 24061-0106 W. Wong et al. For example. S. Our experimental results indicate that each of the above models provides better performance than the classical Boolean model in terms of retrieval effectiveness. and each query is in the form of a Boolean expression. Cater & Kraft (1987). for a query of the form A or B or C or D or E. Lee Systems Analyst. As can be seen from the discussion in Chapter 10. The MMM and Paice models are essentially variations of the classical fuzzy-set model. A standard Boolean query rarely comes close to retrieving all and only those documents which are relevant to a query.1 INTRODUCTION In the standard Boolean retrieval model. the model has the following limitations. Acumenics Research and Technology. The Boolean expression consists of a set of index terms connected by the Boolean operators and. VA 22030 Abstract The classical interpretation of the Boolean operators in an information retrieval system is in general too strict. and that it should be retrieved. each document is associated with a set of keywords or index terms.htm (1 of 32)7/3/2004 4:21:33 PM . Fox & Wu (1983). it appears that the user would be interested in such a document. M. However. 9990 Lee Highway. for example. consider a query of the form A and B and C and D and E. Virginia Tech (Virginia Polytechnic Institute & State University). a document indexed by any of these terms is considered just as important as a document indexed by some or all of them.. Similarly. the Boolean model is popular in operational environments because it is easy to implement and is very efficient in terms of the time required to process a query. The documents retrieved for a given query are those that contain index terms in the combination specified by the query. Many models have been proposed with the aim of softening the interpretation of the Boolean operators in order to improve the precision and recall of the search results. Intuitively. and the P-norm models. as discussed by a number of authors: Bookstein (1985). results of the retrieval based on a strict interpretation have lower precision than results from a P-norm interpretation as shown file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. Fairfax. Koushik Department of Computer Science. The Boolean model gives counterintuitive results for certain types of queries. Suite 580. or.

to indicate that the presence or absence of a particular query term is more important than that of another term. Experimental results have shown that these models can given an improvement of more than 100 percent in precision. and then discuss the data structures and algorithms that implement them.2 EXTENDED BOOLEAN MODELS In the three models discussed below. A few examples of the similarity computation in each model are also provided. In the Boolean model. We can do this by making the and query behave a bit like the or query and the or query behave somewhat like the and query.Books_Algorithms_Collection2ed/books/book5/chap15. and account for the uncertainties that are present in choosing them. systems combining features of both the Boolean and the vector models have been built to allow for ranking the result of a Boolean query. Koll & McGill. However. we need to soften the Boolean operators.. They attempt to take into account the uncertainty involved in the indexing process. We conclude with a short discussion of the relative merits of these models. This chapter discusses three such models: MMM Paice P-norm These models avoid the strict interpretation of the Boolean operators. and attempt to provide a ranking of the retrieved documents in order of decreasing relevance to the query. Also. 15.htm (2 of 32)7/3/2004 4:21:33 PM . and stop at a certain point if he/she finds that many of the documents are no longer relevant to the query. We begin with a description of each model. This document file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The standard Boolean model has no provision for ranking documents. Ranking the documents in the order of decreasing relevance allows the user to see the most relevant document first. In addition. the P-norm model has the ability to consider weighted query terms. there is no provision for capturing the uncertainty that is present in making indexing decisions. Assigning weights to index terms adds information during the indexing process. a document has a weight associated with each index term. over the standard Boolean model as reported in Lee (1988).. The Boolean model has no provision for assigning importance factors or weights to query terms. for example.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS by Salton. the SIRE system (1977) Noreault. During the indexing process for the Boolean model. Many retrieval models have been proposed as alternatives to the Boolean model. It would be useful to allow weights to be assigned to (some) query terms. the user would be able to sequentially scan the documents. Fox & Wu (1983). it is necessary to decide whether a particular document is either relevant or nonrelevant with respect to a given index term. Yet. Hence. at fixed recall levels. searchers often can rate or rank index terms in queries based on how indicative they are of desired content.

or An) and Qand = (A1 and A2 and . dA2. . . . and An). dA2.. we assume that document weights for all index terms lie in the range [0. . . The query-document similarity is an attempt to predict the relevance of a document to the query. . each index term has a fuzzy set associated with it. . .Books_Algorithms_Collection2ed/books/book5/chap15. = max(dA. .Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS weight is a measure of the degree to which the document is characterized by that term. dAn for terms A1. say dA. .1 The MMM model This model is based on the concept of fuzzy sets proposed by Zadeh (1965). The MMM model attempts to soften the Boolean operators by considering the query-document similarity to be a linear combination of the min and max document weights. . . given a document D with index-term weights dA1. . dAn) + Cand2 * max(dA1. In fuzzy-set theory. Without loss of generality. This is less restrictive than in the standard Boolean model. . we consider each model and its method for calculating similarity. dA2. dAn) SlM(Qand.2. . D) = Cand1 * min(dA1. To retrieve documents relevant to a given query. . an element has a varying degree of membership. . A2. the documents that should be retrieved for a query of the form A and B. we need to calculate the query-document similarity for documents in the collection. . dA2. An. In the Mixed Min and Max (MMM) model developed by Fox and Sharat (1986). In the following subsections.htm (3 of 32)7/3/2004 4:21:33 PM . Similarly. it is possible to define the similarity of a document to the or query to be max(dA. dB). 1]. . The document weight of a document with respect to an index term A is considered to be the degree of membership of the document in the fuzzy set associated with A. should be in the fuzzy set associated with the intersection of the two sets. . to a given set A instead of the traditional membership choice (is an element/is not an element). the query-document similarity in the MMM model is computed as follows: SlM(Qor. namely 0 and 1. 15. documents that should be retrieved for a query of the form A or B. . Hence.. D) = Cor1 * max(dA1. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. . should be in the fuzzy set associated with the union of the two sets A and B. . . and the similarity of the document to the and query to be min(dA. dB). which limits the values to the extremes of the range. db) B B db) According to fuzzy set theory. and the queries Qor = (A1 or A2 or . Thus. dAn) + Cor2 * min(dA1. The degree of membership for union and intersection are defined as follows in Fuzzy set theory: dA dA = min(dA.

15.Cand2.Books_Algorithms_Collection2ed/books/book5/chap15.htm (4 of 32)7/3/2004 4:21:33 PM . as can be seen from Table 15. and to 0. Cor2 are "softness" coefficients for the or operator. . . and the queries Qor = (A1 or A2 or . gives good retrieval effectiveness. For simplicity it is generally assumed that Cor1 = 1 . much better than with the standard Boolean model..Cor2 and Cand1 = 1 .. . 15. the Paice model behaves like the MMM model. Our experiments (Lee & Fox 1988) indicate that the best performance usually occurs with Cand1 in the range [0.. . In general. given a document D with index-term weights dA1. Cand2 are softness coefficients for the and operator. An. generally we have Cor1 > Cor2 and Cand1 > Cand2. dAn for terms A1. It is similar to the MMM model in that it assumes that there is a fuzzy set associated with each index term.2.. Since we would like to give the maximum of the document weights more importance while considering an or query and the minimum more importance while considering an and query. and Cand1. Thus. The computational cost for this model is higher than that for the MMM model. The experiments conducted by Lee & Fox (1988) have shown that setting the value of r to 1.3 The P-norm Model file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. Note that when n = 2.. while the MMM model considers only the maximum and the minimum document weights for the index terms while calculating the similarity. dAn) where Cor1. the computational cost is low. . and retrieval effectiveness is good.2 The Paice Model The model proposed by Paice (1984) is also based on fuzzy-set theory.5.4 below. . the Paice model takes into account all of the document weights. 0. .0 for and queries. which can be done with an O(n) algorithm as described in Aho and Ullman (1974).8] and with Cor1 > 0. The Paice model requires the term weights to be sorted in ascending or descending order. This requires at least an 0(n log n) sorting algorithm.2.2. However. and the document weight of a document with respect to an index term represents the degree of membership of the document with respect to the fuzzy set associated with that index term. .1 in section 15. An) and Qand = (A1 and A2 and . . dA2 . This is because the MMM model only requires the determination of min or max of a set of term weights each time an and or or clause is considered.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS dA2 . A good deal of floating point calculation is needed also. . and An) the query-document similarity in the Paice model is computed as follows: where the di's are considered in descending order for or queries and in ascending order for and queries. A2.7 for or queries. depending on whether an and clause or an or clause is being considered..

ai) which indicate their relative importance. . . Thus. . 15. we need to convert the given set of documents and queries into a standard format. Fox (1983). Lee & Fox (1988). with proper normalization as described by Fox (1983). 1) for an and query. . a2) orP . . andP (An. a2) andP .. a document D with weights dA1. We first discuss the data structures used in our implementation. Fox & Sharat (1986). the P-norm model also allows query terms to have weights. Fox & Wu (1983). . the query terms have weights (i.htm (5 of 32)7/3/2004 4:21:33 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. A2. . . orP (An.. 0) for an or query. . 1. The query-document similarity for the P-norm model is calculated using the following formulae: Numerous experiments have shown the P-norm model to be very effective as reported in Salton. from those with highest similarity to those with lowest similarity) in order of decreasing distance from the point (0. Consider an or query of the form dA1 or dA2 . it is possible to rank the documents (i. 15.. the generalized queries are of the form: Q Q or p = (A1.. an). ..e.3 IMPLEMENTATION As in any other information retrieval system. dA2.e. and that uniform query weighting of 1 or based on inverse document frequency will lead to effective retrieval as described by Fox (1983).e.. i. the computational expense can be high. dA2. This is because of the need for expensive exponentiation computations. dAn) in an n-dimensional space.. and dAn. These document weights are generally obtained using term frequency and inverse document frequency statistics. Sophisticated users may wish to assign the P-values and query weights. . the point with all the n coordinates equal to 1 (indicating that all the index terms have weight 1) is the most desirable point. the Boolean case) of the operator. when the P-value is greater than one. An is considered to be a point with coordinates (dA1.Books_Algorithms_Collection2ed/books/book5/chap15. In the P-norm model. .. .1 Data Structures There are several important data structures for both the document and query representations. . . . . dAn with respect to index terms Al. Extensive experimentation demonstrated that system-assigned p-values (e. It is clear that the point having all the n coordinates equal to 0 (indicating that all the index terms are absent) is to be avoided for this query. p = 2 throughout a query) can give good results.g.. and then follow up with a discussion of the actual algorithms. an). .Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS Besides allowing document weights for index terms. Also. a1) andP (A2. a1) orP (A2. whereas for an and query of the form dA1 and dA2.3. and in order of increasing distance from the point (1.. and p We note that the operators have coefficients P which indicate the degree of strictness (from 1 for least strict to for most strict. and = (A1. However. . 0. . . In the P-norm model. . . . or dAn.

Books_Algorithms_Collection2ed/books/book5/chap15. OP_TUPLE *beg_op_array.1). } QUERY_STRUCT We see from the query structure that there are three arrays associated with each query (see Figure 15. The element doc_wt [i] is the document weight for the ith index term. *beg_node _array. The data structure for a document consists of an array doc _wts that stores the document weights as shown below.. the goal is to index them. and assign document weights for each document with respect to each index term. and is in the form of a tree with the operators as internal nodes and index terms as the leaves (see Figure 15.2): an array of the nodes in the query tree an array of index terms an array of operators file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The data structure used to store a query is as follows: typedef struct { long query_id.. find the index terms in the collection. typedef struct { float doc_wts[NUM_ITERM]. that is. To calculate the document weight for an index term.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS Document structure Given a set of documents. } DOC_STRUCT /*document weight vector*/ Query Structure We assume that each query has an id associated with it. /* query id */ /* array of query nodes */ /* array of index terms */ /* array of operators */ TREE_NODE ITERM_TUPLE *beg_iterm_array.htm (6 of 32)7/3/2004 4:21:33 PM . a weighting technique such as normalized (tf * idf)ik can be used. A more detailed discussion of weighting techniques is provided in Chapter 10.

the value of its child_index / sibling_index will be equal to UNDEF (-1). they cannot have any children. sibling_index is the index of the next node to the right of the current node within the array of nodes.htm (7 of 32)7/3/2004 4:21:33 PM . The iterm_op_index in the TREE_NODE structure is the index number of the index-term/operator associated with the current node within the array of index-term/operators. The data structure that stores information about an index term is as follows: file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Books_Algorithms_Collection2ed/books/book5/chap15. then it implies that the current node is an index-term node. It is possible to determine whether the current node is an operator node or an index-term node by examining the value of child_index (see Figure 15. Similarly. */ short child_index. /* index of index-term/operator for leaf/non-leaf node resp.2: Structure of QUERY-STRUCT In the query structure we store the address of the first element of each array... otherwise it implies that the node is an operator node. Hence. In our implementation we assume that the first element of the node array is the root of the query tree. /* index of the left child of the current node */ short sibling_index. /* index of right sibling of the current node */ } TREE_NODE /* structure of a node in the query tree */ In this structure. the child_index is the index of the leftmost child of the current node within the array of nodes.1: Structure of query (A or B or C) and (D or E) Figure 15. whereas operators will have at least one child. if child_index has value UNDEF.3). If a node does not have any children / siblings to its right. Since index terms can only be leaves of the tree. The tree-node structure is as follows: typedef struct { int iterm_op_index.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS Figure 15.

/* type ot operator - AND_OP / OR_OP / NOT_OP */ long op_coeff..Cor1 and Cand2 = 1 .Cand1. which is the weight of the index term in the query. then the op_coeff is the r value. and = P value for P-norm model */ float } OP_TUPLE op_wt. The data structure for an operator is as follows: typedef struct { int op_type. /* coefficient of the operator - = 0.3: Data structures generated for query ((A or B or C) and (D or E)) Here. /* unique ID given to index-term */ /* weight of the index-term */ } ITERM_TUPLE /* structure of an index term */ Figure 15. iterm_num is the unique number given to an index term.0 for NOT_OP = C_or1 / C _and1 for MMM model. of the operator */ /* structure of an operator */ If the MMM model is being used. If the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS typedef struct { long float iterm_num. (Note: We assume that Cor2 = 1 . /* wt.Books_Algorithms_Collection2ed/books/book5/chap15. then op_coeff is the Cor1 / Cand1 value.) If on the other hand the Paice model is being used. Note that iterm_wt. iterm_wt. is used only for P-norm queries.htm (8 of 32)7/3/2004 4:21:33 PM .. = r value for Paice model.

2 Algorithms and Examples Given a query and a set of documents in the standard form. we use the following algorithm to retrieve and rank documents: For each document Find similarity. numerator and denominator (both of type double).dAn).) If the child is an index term. where Once all the children of the node have been considered..dAn). If it is an operator.. Sort similarities. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. For the P-norm model: We update the values of two variables.....htm (9 of 32)7/3/2004 4:21:33 PM . 15. where maximum = max(dA1 dA2. then the similarity is the document weight for that index term. and minimum = min(dA1 dA2. we use the appropriate formula to find the actual similarity... then we recursively calculate the similarity for each child of that node. child_sims.. further processing is different for the various models and is described below: For the MMM model: We update the values of two variables. For the Paice model: We store the similarity in an array (of type double). Display document / similarity data in descending order. and find the similarity for each child. we start from the root of the query tree.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS P-norm model is being used. which will contain the similarities of the children of the current node. In order to calculate the query-document similarity. Once the similarity value is determined for a child.. maximum and minimum (both of type double).3. then op_coeff is the P-value.Books_Algorithms_Collection2ed/books/book5/chap15. (Refer to the appendix for the C program.

D) = 0.. if the document weights for the index terms A.52 * 0. the MMM similarity between the given document and query is SIM(Q. We will therefore have the similarities in the order: 0..82 0. and in ascending order for an AND node. Hence.25 0.70*0. B.5 and the operator coefficient or the P or operator has the coefficient value is 2.htm (10 of 32)7/3/2004 4:21:33 PM . 0.8.8. then we use the values of the numerator and denominator in the following formula to calculate the P-norm query-document similarity: SIM(Qor. For example. 0.71 If the Paice similarity is being calculated. and C in a query Q of the form A or B or C and for a particular document D are 0.6.ooks_Algorithms_Collection2ed/books/book5/chap15.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS For the MMM model.72 = 1.70 + 0.D) = Cop * maximum + (1 . we begin by sorting the array child_sims in descending order for an OR node.7. with the assumption that r = 0.5.7 * 0. using the above formula.5/0.19 = 0.D) = 0. For the query Q given above.5 = 0. If the coefficient of the or operator is 0. and 0.25 + 0.6689 If the P-norm similarity is being calculated.Cop) * minimum.5. respectively. then the value of the numerator and denominator after considering the three index terms will be Index 1 Term A Numerator 0 + 0.465/2.52 file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. we can calculate the similarity for the node by using the following formula: SIM(Q. the Paice similarity between the given document and query will then be SIM(Q.D) = (numerator/denominator)1/p_value SIM(Qand.52 * 0.8 + 0. using the formula given above. then.0625 2 B Denominator 0 + 0. then the maximum of the three is 0.6 + 0.71*0.8 + (1 . and the minimum 0.7) * 0.72*0.71 + 0. since the operator is OR_OP we sort the similarities in descending order. We then use the following formula to calculate the Paice query-document similarity.D) = 1 . 0.5.52 = 0.52 = 0.7. the query term weights are 0.(numerator/denominator)1/p_value If for the query Q given above.0625 + 0.6.8.0.

D) = (0.50 + 0. Based on the value of the third parameter. D) will be SIM(Qor. the query vector which is of type QUERY_STRUCT as the second parameter. the query-document similarity SIM (Qor.62 = 0. CACM..3125/0. As can be seen in the table. Using the final values of the numerator and denominator and the formula given above.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS = 0.52 * 0.6455 The C programs that implement these algorithms are given in the appendix.ooks_Algorithms_Collection2ed/books/book5/chap15. MMM. Paice. all three models show substantial improvement over the standard Boolean model.1: Percentage Improvements Over Standard Boolean Model Test Collection Scheme CISI ------------------Best Average Precision Rank CACM -----------------Best Average Precision Rank INSPEC -----------------Best Average Precision Rank file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. that is.2225 3 C 0. and the INSPEC collections. all three models show more than 100 percent improvement.. Table 15. In fact.htm (11 of 32)7/3/2004 4:21:33 PM .3125 = 0. The routine cal_sim() is called with the array containing the document weights which are of type float as the first parameter.50 0.1 gives the percentage improvement in the average precision that these models showed over the standard Boolean model for three test collections: the CISI.4 CONCLUSION Numerous experiments performed with the extended Boolean models have shown that they give much better performance than the standard Boolean model as reported by Lee & Fox (1988).75)1/2 = 0.225 + 0. for the CACM and the INSPEC collections.52 = 0. and the third argument sim_type which determines the type of similarity being computed. one of the following sets of routines is called for further computations: mmm_sim( ): MMM model paice_sim( ): Paice model update_numer_denom( ) & p_norm_sim( ): P-norm model 15. Table 15. or P-norm.75.

. FOX. "Soft Evaluation of Boolean Search Queries in Information Retrieval Systems. and J. W.. T. Thesis. A. KOLL. CATER.. 3(1). ed. Mass.. E. W. 1983. C. 117-51. The Paice model requires one to sort the similarities of the children." in M. and D. "Automatic Ranked Output from Boolean Searches in SIRE. FOX.." Reading. The MMM model clearly has the least computational overhead of the three models. C." Information Technology. June. 1988." Cornell University.htm (12 of 32)7/3/2004 4:21:33 PM . PAICE.. LEE. "TIRS: A Topological Information Retrieval System Satisfying the Requirements of the Waller-Kraft Wish List..S. J. Dev. Based on experimental studies. REFERENCES AHO. But by using any one of the extended Boolean models. A. Applications. it is possible to get much better performance. 1988. 1977. 28(6). A. August. "Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. Williams.ooks_Algorithms_Collection2ed/books/book5/chap15. and E. ULLMAN. J.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS ------------------------------------------------------------------P-norm Paice MMM 79 77 68 1 2 3 106 104 109 2 3 1 210 206 195 1 2 3 The P-norm model is computationally expensive because of the number of exponentiation operations that it requires. E." Information Processing & Management 24(3). 20. E. 33-42." Presented at the 10th Annual Int'l ACM-SIGIR Conference on R&D in Information Retrieval." Technical Report TR-86-1. COOPER. 333-39. A. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. M. 1987. 1985. "A Comparison of Two Methods for Soft Boolean Interpretation in Information Retrieval. "Probability and Fuzzy-Set Applications to Information Retrieval. 171-80. MCGILL. "Getting Beyond Boole. American Society for Information Science. NOREAULT. Department of Computer Science. KRAFT. Annual Review of Information Science and Technology (ARIST). The standard Boolean model is still the most efficient model. and S. "The Design and Analysis of Computer Algorithms. A. 243-48. 1984. S. FOX. "Experimental Comparison of Schemes for Interpreting Boolean Queries. S. 1986.: Addison-Wesley." Virginia Tech M. the MMM model appears to be the least expensive. This adds to its computational overhead. SHARAT. Virginia Tech. 1974. Res. P. C." J. and M. BOOKSTEIN. while the P-norm model gives the best results in terms of average precision. H. Technical Report TR-88-27 Department of Computer Science..

1965. E..0 typedef struct { /* Used only for P-norm model */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. V. 47-63. ZADEH. "Extended Boolean Query Processing in the Generalized Vector Space Model. WONG. 8. 1983." Communications of the ACM. WU. L. and H. APPENDIX /** ** Compute query-document similarity using MMM." Information Systems 14(1).Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS SALTON.. "Extended Boolean Information Retrieval.ooks_Algorithms_Collection2ed/books/book5/chap15.h> #include <math. N. and P. V.. M. WONG. FOX. Paice and P-norm ** formulas **/ #include <stdio.htm (13 of 32)7/3/2004 4:21:33 PM . K. 1988. A.. A. 1022-36." Information and Control. 26(12). C. RAGHAVAN. ZIARKO. W.h> #define UNDEF-l #define NOT_OP -2 #define AND_OP -3 #define OR_OP #define MMM_SIM #define PAICE_SIM -4 1 2 #define P_NORM_SIM 3 #define P_INFINITY 0. G. "Fuzzy Sets. 338-53. S.

float iterm_wt. NOT_OP */ /* Coefficient of the operator = C_or1/C_and1 for MMM model.(P-norm queries only) */ /* Structure of an operator in a query */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.. OP_OP. typedef struct { int op_type. and = P value for P-norm model */ float op_wt. = 0.. } TREE_NODE. } OP_TUPLE. = r value for Paice model. /* Unique ID given to the index-term */ /* Weight of index-term (P-norm queries only) */ /* Structure of an index-term in a query */ /* operator type . of operator .ooks_Algorithms_Collection2ed/books/book5/chap15. else operator index */ short child_index. typedef struct { long iterm_op_index. typedef struct { /* Current node's left child index */ /* Current node's right sibling index */ /* Structure of a node in the query tree */ /* If child_index = UNDEF then concept /* Wt. index. short sibling_index.htm (14 of 32)7/3/2004 4:21:33 PM . float op_coeff.AND_OP. } ITERM_TUPLE.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS long iterm_num.0 for NOT_OP.

. no. ***************************************************************************/ double calc_sim ( doc_wts. query_vec. /* Query id */ /* Array of query nodes */ /* Array of index terms */ /* Array of operators */ /* Structure of a query */ /* Max. or PAICE_SIM (Doc. Plan: Call the routine to calculate the similarity by passing to it the index of the root of the query tree (= zero).htm (15 of 32)7/3/2004 4:21:33 PM . sim_type ) float doc_wts[NUM_ITERMS]. weights vector */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. Query ) if sim_type = P_NORM_SIM. query_vec. *beg_node_array. *beg_iterm_array.. sim_type ) Returns: The MMM / Paice / P-norm similarity between the given document and query.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS long TREE_NODE query_id. *beg_op_array. /* In: doc.ooks_Algorithms_Collection2ed/books/book5/chap15. Purpose: To compute: MMM_SIM (Doc. #define NUM_ITERMS 1000 allowed in a document */ /*************************************************************************** double calc_sim ( doc_wts. Query ) if sim_type = PAICE_SIM. of index-terms ITERM_TUPLE OP-TUPLE } QUERY-STRUCT. Query ) if sim_type = MMM_SIM. or P_NORM_SIM (Doc.

query_vec.. sim_type ) ). sim_type. } /*************************************************************************** static double calc_tree_sim ( doc_wts. query %d\n". root_index.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS register QUERY_STRUCT *query_vec. return ( calc_tree_sim (doc_wts. short { double calc_tree_sim (). for that index-term else. 0. "calc_sim: illegal similarity\ type %d. sim_type ) Returns: The similarity between the given document and the subtree of the query tree with its root at ROOT_INDEX Purpose: To compute: SIM (Doc.htm (16 of 32)7/3/2004 4:21:33 PM . /* In: Given Query vector */ /* In: Similarity type flag */ if ( ( sim_type != MMM_SIM ) && ( sim_type != PAICE-SIM ) && ( sim_type != P_NORM_SIM ) ) fprintf ( stderr.ooks_Algorithms_Collection2ed/books/book5/chap15. if it is an operator then.. Query-subtree) Plan: If the root of the subtree is an index-term then similarity is the document wt. if the operator == NOT_OP then (1) Calculate the similarity of the child of file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. query_vec-> query_id ). sim_type. query_vec.

CHILD_VALUE) else if the operator == AND_OP or OR_OP then (1) Calculate the similarity considering each child of the current node (2) If (SIM_TYPE == MMM_SIM) then Find the maximum and the minimum of these similarities. (3) Use the appropriate similarity computation formula to compute the similarity for the current node. query_vec. else if (SIM_TYPE == PAICE_SIM) then Store this similarity in the array child_sims. ***************************************************************************/ static double calc_tree_sim ( doc_wts.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS the current node (CHILD_VALUE) (2) Return (1.htm (17 of 32)7/3/2004 4:21:33 PM . sim_type) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.0 . root_index...ooks_Algorithms_Collection2ed/books/book5/chap15. else if (SIM_TYPE == P_NORM_SIM) then Find the numerator and denominator to be used in the formula else Print "invalid sim_type".

of a child of the root /* Addr. subtree being considered */ register TREE_NODE *child_ptr. /* Maximum (children's sim.) */ /* Func. short root_index.. { register TREE_NODE *root_ptr. of current operator */ /* Addr. long doc_index. /* Sim.htm (18 of 32)7/3/2004 4:21:33 PM . /* In: Similarity type flag */ /* Addr. computing MMM sim. register ITERM_TUPLE *iterm_ptr. /* In: Document weights */ /* In: Query vector */ /* In: Index of the root of the register QUERY_STRUCT *query_vec.) */ /* Minimum (children's sim. mmm_sim ().. /* Index of the child of the root within the set of tree nodes */ double child_value. of child's subtree */ /* Declarations for MMM model */ double double double maximum.ooks_Algorithms_Collection2ed/books/book5/chap15. */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. of subtree being considered */ register OP_TUPLE *op_ptr.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS float doc_wts [NUM_ITERMS]. minimum. of the root of the /* Addr. of current index-term */ /* Index of the weight of the current index-term within DOC-WTS array */ short child_index. subtree to be evaluated */ short sim_type.

then return its doc-wt. of first node of the tree + index of the current node = the addr. formula */ /* Func. */ /* Declarations for P-norm model */ double double double void child_wt. denominator. */ /* Recursive call for each subtree */ double calc_tree_sim().ooks_Algorithms_Collection2ed/books/book5/chap15.htm (19 of 32)7/3/2004 4:21:33 PM . if { /* If node is an index-term. computing P-norm sim. numerator.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS /* Declarations for Paice model */ double child_sims[NUM_ITERMS]. of the child */ /* Numerator in sim. /* Func. computing Paice sim. Index in */ /* the doc_wts array is the ITERM_NUM of current index-term */ iterm_ptr = query_vec->beg_iterm_array + root_ptr->iterm_op_index. UNDEF == root_ptr->child_index) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. /* Number of sim. /* Wt. /* Func.. of all the children of the root */ int curr_i. /* Addr. stored in child_sims */ double paice_sim ().. update_numer_denom(). of the root of the subtree under consideration */ root_ptr = query_vec->beg_node_array + root_index. formula */ /* Denominator in sim. /* Array to store sim. updating value of */ /* numerator & denominator */ double p_norm_sim ().

return ( (double) UNDEF ). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. op_ptr->op_type. query_vec->query_id). "calc_tree sim:\ illegal operator type %d. } switch (op_ptr->op_type) { case (NOT_OP): /* If the operator is NOT_OP */ if (UNDEF != (query_vec->beg_node_array + root_ptr->child_index)->sibling_index) { /* NOT_OP operator can have only one child */ fprintf(stderr. } else { /* If current node is an operator. then compute its sim.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS doc-index = iterm_ptr->iterm_num.. if ( ( op_ptr->op_type != NOT_OP ) && ( op_ptr->op_type != OR_OP ) && ( op_ptr->op_type != AND_OP ) ) { /* if neither NOT.\n". */ op_ptr = query_vec->beg_op_array + root_ptr->iterm_op_index.htm (20 of 32)7/3/2004 4:21:33 PM . OR or AND */ fprintf ( stderr. return ( (double) doc_wts [doc_index] )..ooks_Algorithms_Collection2ed/books/book5/chap15. query %d.

ooks_Algorithms_Collection2ed/books/book5/chap15. return ( (double) UNDEF). return (1. sim_type))) return ( (double) UNDEF ).0. for P-norm model */ /* Init. until none left */ for (child_index = root_ptr->child_index.0. } /* if only child. UNDEF != child_index. for MMM model */ /* Init.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS "calc_tree_sim: NOT operator has more\ than one child.0. minimum = 99999. for Paice model */ /* Start with the first child of the current node...0 . curr_i = -1. root_ptr->child_index. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.0-child_value ).similarity of child) */ if ((double) UNDEF == (child_value = calc_tree_sim ( doc_wts. denominator = 0. break. * * consider each of its siblings.htm (21 of 32)7/3/2004 4:21:33 PM . query_vec. case (OR_OP): case (AND_OP): /* If the operator is OR_OP or AND_OP */ maximum = -99999.0. return ( 1. numerator = 0.\n'). /* Init.

query_vec. break.htm (22 of 32)7/3/2004 4:21:33 PM . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. of the child */ child_ptr = query_vec->beg_node_array + root_ptr->child_index.. child_index. break. sim_type))) return ((double UNDEF). child_sims [curr_i] = child_value.. case (PAICE_SIM): curr_i = curr_i + 1. switch (sim_type) { case (MMM_SIM): /* update max and mim */ maximum = (child_value > maximum) ? child_value : maximum. minimum = (child_value < minimum) ? child_value : minimum.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS child_index = (query_vec->beg_node_array + child_index)->sibling_index ) { if ( (double)UNDEF == (child_value = calc_tree_sim ( doc_wts. case (P_NORM_SIM): /* Find the wt.ooks_Algorithms_Collection2ed/books/book5/chap15.

maximum. &denominator ). of current subtree with appropriate formula */ if ( sim_type == MMM_SIM ) return ( mmm_sim ( op_ptr->op_coeff. &numerator. else file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. op_ptr->op_coeff. update_numer_denom ( child_value. minimum)). op_ptr->op_type. } } /* switch .Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS if (UNDEF == child_ptr->child_index) child_wt = (query_vec->beg_iterm_array + child_ptr->iterm_op_index)->iterm_wt.. break. child_wt. op_ptr->op_type. else if ( sim_type == PAICE_SIM ) return (paice_sim (op_ptr->op_coeff.sim_type */ /* for */ /* After considering all the children. child_sims. else child_wt = (query_vec->beg_op_array + child_ptr->iterm_op_index)->op_wt..htm (23 of 32)7/3/2004 4:21:33 PM .ooks_Algorithms_Collection2ed/books/book5/chap15. compute the * * sim. op_ptr->op_type. curr_i )).

op_ptr->op_type. } } } /*************************************************************************** double mmm_sim (coeff. maximum.. minimum) Returns: The MMM similarity Purpose: To calculate the MMM similarity using MAXIMUM and MINIMUM.htm (24 of 32)7/3/2004 4:21:33 PM . type. D) = coeff * MINIMUM + (1-coeff) * MAXIMUM Plan: Depending on the type of the operator use the /* switch . int op_type. /* In: Value of the coefficient */ /* In: Type of operator */ /* In: Maximum of the similarities */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. denominator ) ). D) = coeff * MAXIMUM + (1-coeff) * MINIMUM SIM(Qand. numerator. maximum..op_ptr->op_type */ /* else */ appropriate formula ***************************************************************************/ double mmm_sim ( op_coeff. double maximum. minimum ) register float op_coeff. SIM(Qor. op_type.ooks_Algorithms_Collection2ed/books/book5/chap15.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS if ( sim_type == P_NORM_SIM ) return ( p_norm_sim (op_ptr->op_coeff.

} /*************************************************************************** double paice_sim (op_coeff.ooks_Algorithms_Collection2ed/books/book5/chap15. { if ( op_type == OR_OP ) /* In: Minimum of the similarities */ return ( (op_coeff * maximum + (1 .i) ) .op_coeff) * minimum)). for i = 0 to num_i denominator = SUM (op_coeff/\(num_i-i)). op_type. else if ( op_type == AND_OP ) return ( (op_coeff * minimum + (1 . file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD..op_coeff) * maximum)). child_sims.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS double minimum. num_i) Returns: The Paice similarity Purpose: To calculate the Paice similarity using the Paice similarity computation formula Plan: Sort the array in ascending order If the operator is OR_OP then: numerator = SUM (child_sims[i] * op_coeff/\(num_i . for i = 0 to num_i else if the operator is AND_OP then: numerator = SUM (child_sims[i] *op_coeff/\(i) )..htm (25 of 32)7/3/2004 4:21:33 PM .

. used by qsort */ qsort ( (char *) child_sims. numerator = 0. denominator = 0. int double int op_type. /* In: Coefficient r */ /* In: Type of operator */ /* In: Array containing sim.. of elements in child_sim */ { int double double double void int i. op_type. comp_double ).Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS for i = 0 to num_i denominator = SUM (op_coeff/\(i) ). */ /* In: No. num_i. numerator.htm (26 of 32)7/3/2004 4:21:33 PM .ooks_Algorithms_Collection2ed/books/book5/chap15. child_sims [NUM_ITERMS]. num_i ) register float op_coeff. qsort (). for i = 0 to num_i ***************************************************************************/ double paice_sim ( op_coeff. denominator. /* Quick Sort Func. power. */ /* Func. sizeof (double). in C lib. num_i. comp_double (). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. child_sims.

denominator = denominator + power. } } file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. i <=num_i. denominator = denominator + power. i <= num_i.htm (27 of 32)7/3/2004 4:21:33 PM . } return (numerator/denominator). i++) { power = pow((double) op_coeff. (double)i). numerator = numerator + power*child_sims[i]. } return ( numerator / denominator ).Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS if ( op_type == OR_OP ) { for (i = 0. (double)num_i-i). numerator = numerator + power * child_sims[i]. i++) { power = pow((double) op_coeff. } else if(op_type == AND_OP) { for (i = 0...ooks_Algorithms_Collection2ed/books/book5/chap15.

d2) double *d1. } /*************************************************************************** void update_numer_denom (value..Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS /*************************************************************************** int comp_double ( d1. 0. denominator) file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.htm (28 of 32)7/3/2004 4:21:33 PM . else return (1). { if (*d1 < *d2) return (-1). weight. numerator. *d2. op_type. or 1. ***************************************************************************/ int comp_double (d1. Returns: int Purpose: Compares d1 and d2 and returns -1.. p_value. else if (*d1 == *d2) return (0).ooks_Algorithms_Collection2ed/books/book5/chap15. d2 ).

int op_type.htm (29 of 32)7/3/2004 4:21:33 PM . /* In: The sim.value) /\P_value DENOMINATOR = DENOMINATOR + weight/\P_value ***************************************************************************/ void update_numer_denom ( value. weight. numerator. denominator ) double value.. double weight. of the child */ /* In: The query weight */ /* In: The p_value */ /* In: The type of the operator */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. register float p_value. p_value.. op_type.ooks_Algorithms_Collection2ed/books/book5/chap15.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS Returns: Void Purpose: Update the values of NUMERATOR and DENOMINATOR Plan: If P_value == P_INFINITY then if VALUE > NUMERATOR then NUMERATOR = VALUE DENOMINATOR = 1 else if OP_TYPE == OR_OP then NUMERATOR = NUMERATOR + weight/\P_value * value/\P_value DENOMINATOR = DENOMINATOR + weight/\P_value if OP_TYPE == AND_OP then NUMERATOR = NUMERATOR + weight/\P_value * (1 .

/* Out: Numerator in p-norm sim..ooks_Algorithms_Collection2ed/books/book5/chap15. *denominator = *denominator + power. calculation formula */ { double power double pow () if (p_value == P_INFINITY) { *denominator = 1. break.htm (30 of 32)7/3/2004 4:21:33 PM . p_value). calculation formula */ register double *denominator..p_value). } else switch (op_type) { case (OR_OP): power = pow (weight.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS register double *numerator. /* Out: Denominator in p-norm sim. *numerator = *numerator + power * pow(value. if (value > *numerator) *numerator = value. case (AND_OP): file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.

/* In: P value */ /* In: Type of operator */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD.Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS power = pow (weight.ooks_Algorithms_Collection2ed/books/book5/chap15.op_type... *numerator = *numerator + power * pow(1-value.denominator) Returns: The P-norm similarity Purpose: To calculate the P-norm similarity using the P-norm similarity computation formula Plan: Depending on the type of the operator use the appropriate formula SIM(Q (P_INFINITY) .p_value). D) = NUMERATOR SIM (Qor_P. *denominator = *denominator + power. denominator) register float p_value.numerator. D) = (NUMERATOR / DENOMINATOR) /\ (1/P) SIM (Qand_P. int op_type. D) = 1 (NUMERATOR / DENOMINATOR) /\ (1/P) ***************************************************************************/ double p_norm_sim ( p_value.htm (31 of 32)7/3/2004 4:21:33 PM . op_type numerator. } } /*************************************************************************** double p_norm_sim (p_value. p_value ). break.

else if (op_type == AND_OP) return (1 .ooks_Algorithms_Collection2ed/books/book5/chap15.pow (numerator / denominator..Information Retrieval: CHAPTER 15: EXTENDED BOOLEAN MODELS double numerator.htm (32 of 32)7/3/2004 4:21:33 PM . { double pow (). else if (op_type == OR_OP) /* In: Numerator */ /* In: Denominator */ return (pow (numerator / denominator.. 1/p_value) ). } Go to Chapter 16 Back to Table of Contents file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. double denominator. 1/p_value) ). if (p_value == P_INFINITY) return (numerator).

or to determine the structure of the literature of a field. It has been described as a tool of discovery because it has the potential to reveal previously undetected relationships based on complex data (Anderberg 1973). The terms in a document collection can also be clustered to show their relationships. good algorithms (ideally O(N2) time. and the reciprocal nearest neighbor algorithm for Ward's method. but are defined by the items assigned to them. the Voorhees algorithm for group average link. In order to cluster the large data sets with high dimensionality that are typically found in IR applications. which divide a data set of N items into M clusters.Books_Algorithms_Collection2ed/books/book5/chap16. An early application of cluster analysis was to determine taxonomic relationships among species. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The nonhierarchical methods such as the single pass and reallocation methods are heuristic in nature and require less computation than the hierarchical methods. Because there is no need for the classes to be identified prior to processing. In the information retrieval (IR) field. the hierarchical methods have usually been preferred for cluster-based document retrieval. However.1 Introduction Cluster analysis is a statistical technique used to generate a category structure which fits a set of observations. as classification implies. O(N) space) must be found. complete link. group average link. which produce a nested data set in which pairs of items or clusters are successively linked. this is not strictly accurate since the classes formed are not known prior to processing. such as single link. The two main types of cluster analysis methods are the nonhierarchical. 16. cluster analysis is useful to provide structure in large multivariate data sets. The commonly used hierarchical methods. and Ward's method. The groups which are formed should have a high degree of association between members of the same group and a low degree between members of different groups (Anderberg 1973). cluster analysis has been used to create groups of documents with the goal of improving the efficiency and effectiveness of retrieval.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS CHAPTER 16: CLUSTERING ALGORITHMS Edie Rasmussen University of Pittsburgh Abstract Cluster analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items and groups. Examples are the SLINK and minimal spanning tree algorithms for the single link method.. While cluster analysis is sometimes referred to as automatic classification..1 CLUSTER ANALYSIS 16.htm (1 of 27)7/3/2004 4:21:40 PM . and the hierarchical. have high space and time requirements.1.

and there is an extensive and widely scattered journal literature on the subject. Because cluster analysis is a technique for multivariate analysis that has application in many fields. Basic texts include those by Anderberg (1973). and BMDP and cluster analysis packages such as CLUSTAN and CLUSTAR/CLUSTID. Terms may be clustered on the basis of the documents in which they co-occur. and social science applications by Lorr (1983) and Hudson and Associates (1982). Hartigan (1975). Documents may be clustered based on co-occurring citations in order to provide insights into the nature of the literature of a field (e. SPSSX. though it has also been used after retrieval to provide structure to large sets of retrieved documents. Brief descriptions and sources for these and other packages are provided by Romesburg (1984). file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.htm (2 of 27)7/3/2004 4:21:40 PM .1.. Although cluster analysis can be easily implemented with available software packages. Jain and Dubes (1988) and Kaufman (1990)..Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS Psychiatric profiles. Small and Sweeney [1985]). Taxonomic applications have been described by Sneath and Sokal (1973).g.. 16.2 Applications in Information Retrieval The ability of cluster analysis to categorize by assigning items to automatically created groups gives it a natural affinity with the aims of information storage and retrieval.Books_Algorithms_Collection2ed/books/book5/chap16. images. A recent review (Willett 1988) provides a comprehensive summary of research on term-based document clustering. medical and clinical data.. Romesburg (1984). In distributed systems. Cluster analysis can be performed on documents in several ways: Documents may be clustered on the basis of the terms that they contain. clustering can be used to allocate documents for storage. Comprehensive reviews by Lee (1981). Everitt (1980). Spath (1985). The aim of this approach has usually been to provide more efficient or more effective retrieval. Crouch [1988]). in order to aid in the construction of a thesaurus or in the enhancement of queries (e. These include: Selecting the attributes on which items are to be clustered and their representation. it is not without problems. and chemical structures and properties have all been studied using cluster analytic methods. it is supported by a number of software packages which are often available in academic and other computing environments. census and survey data. Aldenderfer and Blashfield (1984).g. Dubes and Jain (1980) and Gordon (1987) are also recommended. Most of the methods and some of the algorithms described in this chapter are found in statistical analysis packages such as SAS.

L for its dimensionality. The following notation will be used: N for the numberof items Di in a data set. This may be a distance measure.Books_Algorithms_Collection2ed/books/book5/chap16. and M for the number of clusters Ci created.. there have been only a few comparative studies (summarized by Willett [1988]).. for example). a summary of research on termweighting approaches is provided by Salton and Buckley (1988).htm (3 of 27)7/3/2004 4:21:40 PM . and the choice of similarity measure can have an effect on the clustering results obtained. If the collection to be clustered is a dynamic one. some means of quantifying the degree of association between them is required. The choice of an appropriate document representation is discussed elsewhere in this text. . where weightik is the weight assigned to termk in Di or Ci.. which can be expensive in terms of computational resources. and each document Di (or cluster representative Ci) is represented by (weighti1.2 MEASURES OF ASSOCIATION 16. Asessing the validity of the result obtained. but more commonly the choice of measure is at the discretion of the researcher. and the similarity coefficient that is chosen. a method for searching the clusters or cluster hierarchy must be selected. weightiL). If the aim is to use the clustered collection as the basis for information retrieval. The results of tests by Willett (1983) of similarity coefficients in cluster-based retrieval suggest that it is important to use a measure that is normalized by the length of the document vectors.2. . Some clustering methods have a theoretical requirement for use of a specific measure (Euclidean distance for Ward's method.1 Introduction In order to cluster the items in a data set. While there are a number of similarity measures available. the determination of interdocument similarity depends on both the document representation. or a measure of similarity or dissimilarity. the requirements for update must be considered. It will be assumed that the items to be clustered are documents. The results of tests on weighting schemes were less definitive but suggested that weighting of document terms is not as significant in improving file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS Selecting an appropriate clustering method and similarity measure from those available . with discussion of the applications and evaluation that have been carried out in an information retrieval environment. Creating the clusters or cluster hierarchies. The emphasis in this chapter will be on the range of clustering methods available and algorithms for their implementation. in terms of the weights assigned to the indexing terms characterizing each document. . In cluster-based retrieval. 16.

The measures described below are commonly used in information retrieval applications. if the similarity measure used is one that results in a 0 value whenever a file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.. They are appropriate for binary or real-valued weighting scheme. when the similarity measure is symmetric (Sij = Sji). Dice coefficient: If binary term weights are used. while those most suitable for comparing document vectors are discussed by Salton (1989). The inverted file algorithm is particularly useful in limiting the amount of computation required to calculate a similarity matrix..3 The Similarity Matrix Many clustering methods are based on a pairwise coupling of the most similar documents or clusters. Sneath and Sokal (1973) point out that simple similarity coefficients are often monotonic with more complex ones. Jaccard and cosine coefficients have the attractions of simplicity and normalization and have often been used for document clustering. the Dice Coefficient reduces to: where C is the number of terms that Di and Dj have in common. and A and B are the number of terms in Di and Dj.2. Jaccard coefficient: Cosine coefficient: 16. so that the similarity between every pair of points must be known. and argue against the use of weighting schemes.2 Similarity Measures A variety of distance and similarity measures is given by Anderberg (1973).Books_Algorithms_Collection2ed/books/book5/chap16. the lower triangular matrix is sufficient (Figure 16.1). The Dice. 16.htm (4 of 27)7/3/2004 4:21:40 PM .2.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS performance in cluster-based retrieval as it is in other types of retrieval. This necessitates the calculation of the similarity matrix.

that is.htm (5 of 27)7/3/2004 4:21:40 PM . The identification of an NN arises in many clustering algorithms.. The document term list is used as an index to the inverted index lists that are needed for the similarity calculation.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS document-document or document-cluster pair have no terms in common (Willett 1980. } for ( doc2 = 0. for ( j = 0. j++ ) counter[doc[j]]++. is required for a document collection.1: Similarity matrix for ( docno = 0. doc2 < n. doc2 ). It should be noted that the similarity matrix can be the basis for identifying a nearest neighbor (NN). Only those document/cluster pairs that share at least one common term will have their similarity calculated. the remaining values in the similarity matrix are set to 0.. docno++ ) { for ( i = 0. doc2++ ) { if (counter [doc2]) calc_similarity( docno. and for large data sets makes a significant file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo.Books_Algorithms_Collection2ed/books/book5/chap16. docno < n. } } The inverted file algorithm can be effectively incorporated in the clustering algorithms described in this chapter when the calculation of the similarity matrix. finding the closest vector to a given vector from a set of N multidimensional vectors. or a single row of it. Perry and Willett 1983). The inverted file algorithm is as follows: Figure 16. i++ ) { retrieve_inverted_list ( term[i] ). j < invlength. i < doclength.

Therefore. these techniques are generally inappropriate for data sets with the high dimensionality typical of information retrieval applications. or recalculating it when needed. Use of the inverted file algorithm to calculate the similarity matrix or a row of it seems to be the best optimization technique available in these circumstances. the resources required for cluster analysis may be considerable.. Most of the preferred clustering methods have time requirements of at least O(N2). where the similarity matrix must be constructed. the choice of algorithm will determine the efficiency with which it is achieved. In this section.3 CLUSTERING METHOS 16. 1980. often only a few hundred items.Books_Algorithms_Collection2ed/books/book5/chap16.5.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS contribution to the computational requirement. but this increases the time requirement by a factor of N2.3. there may be a choice of clustering algorithm or means to implement the method. Calculating and storing the similarity matrix. an overview of the clustering methods most used in information retrieval will be provided. The choice of clustering method will determine the outcome.2 Computational and Storage Requirements In cases where the data set to be processed is very large. or O(N2) if the similarity matrix is stored. Murtagh 1985). The time requirement will be minimally O(NM). which have differing theoretical or empirical bases and therefore produce different cluster structures. much of the early work on cluster analysis for information retrieval was limited to small data sets. where M is the number of clusters. However. The associated algorithms that are best suited to the processing of the large data sets found in information retrieval applications are discussed in sections 16. An alternative is to recalculate the similarity matrix from the stored data whenever it is needed to identify the current most similar pair.htm (6 of 27)7/3/2004 4:21:40 PM . 16. improvements in processing and storage capacity and the introduction of efficient algorithms for implementing some clustering methods and finding nearest neighbors have made it feasible to cluster file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. The storage requirement will be O(N) if the data set is stored. considerable savings of processing time may be achieved.1 Methods and Associated Algorithms There are a very large number of ways of sorting N objects into M groups. Moreover. For a given clustering method. For large N this may be unacceptable. although there are a number of techniques available for introducing efficiency into the NN-finding process (Bentley et al. Most of the possible arrangements are of no interest. there are many clustering methods available. However. for the simpler reallocation methods. A major component of the computation required is the calculation of the documentdocument or document-cluster similarity. a problem compounded by the fact that M is usually unknown. it is the role of a clustering method to identify a set of groups or cluster that reflects some underlying structure in the data. 16. and disk accesses may make processing time too large if the similarity matrix is stored on disk..3. Because of the heavy demands of processing and storage requirements.4 and 16. the proportionality is at least N2. if an efficient NN-finding algorithm can be incorporated into the clustering algorithm. provides a brute force approach to nearest neighbor identification.

the last decade of work on clustering in IR retrieval has concentrated on the hierarchical agglomerative clustering methods (HACM. though this was not so in the IR field. cluster size. The hierarchical methods can be either agglomerative. Salton and Bergmark (1981) have pointed out that there is a high degree of parallelism in the calculation of a set of similarity values. so that large data sets can be partitioned.htm (7 of 27)7/3/2004 4:21:40 PM .. The computational requirement O(NM) is much lower than for the hierarchical methods if M << N. The divisive methods are less commonly used and few algorithms are available. Willett [1988]). and form of cluster representation are required. With improvements in computer resources.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS increasingly large data sets. described by Salton (1971). where no overlap is allowed.3 Survey of Clustering Methods Clustering methods are usually categorized according to the type of cluster structure they produce. and the value of the similarity function (level) at which each fusion occurred. and improved algorithms. these are known as partitioning methods.1 divisions of some cluster into a smaller cluster. The order of pairwise coupling of the objects in the data set is shown. The cluster structure resulting from a hierarchical agglomerative clustering method is often displayed as a dendrogram like that shown in Figure 16. The file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. beginning with all objects in a single cluster and progressing through N . Hierarchical methods Most of the early published work on cluster analysis employed hierarchical methods (Blashfield and Aldenderfer 1978). since a priori decisions about the number of clusters. and the cluster may be represented by a centroid or cluster representative that is indicative of the characteristics of the items it contains. 16. or divisive. criterion for cluster membership. Each item has membership in the cluster with which it is most similar. usually by partitioning the data set in some way and then reallocating items until some criterion is optimized. the nonhierarchical methods attempt to find an approximation. Since the large number of possible divisions of N items into M clusters make an optimal solution impossible..1 pairwise joins beginning from an unclustered data set.Books_Algorithms_Collection2ed/books/book5/chap16. with N . The nonhierarchical methods were used for most of the early work in document clustering when computational resources were limited. Nonhierarchical methods The nonhierarchical methods are heuristic in nature. only agglomerative methods will be discussed in this chapter.2. and parallel hardware also offers the potential for increased processing efficiency (Willett and Rasmussen 1990).3. the easy availability of software packages for cluster analysis. The simple nonhierarchical methods divide the data set of N objects into M clusters. see for example work on the SMART project. The more complex hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected.

group average link: As the name implies. Two other HACM are sometimes used. and at each stage in the clustering the pair of clusters with the most similar mean centroid is merged. A disadvantage of these two methods is that a newly formed cluster may be more like some point than were its constituent points. complete link: The complete link method uses the least similar pair between each of two clusters to determine the intercluster similarity.htm (8 of 27)7/3/2004 4:21:40 PM . and its definition of a cluster center of gravity provides a useful way of representing a cluster. each cluster as it is formed is represented by the coordinates of a group centroid.Books_Algorithms_Collection2ed/books/book5/chap16. tightly bound clusters are characteristic of this method. In the centroid method. or chaining. It tends to produce homogeneous clusters and a symmetric hierarchy.. though it is sensitive to outliers and poor at recovering elongated clusters (Lorr 1983). which makes it suitable for delineating ellipsoidal clusters but unsuitable for isolating spherical or poorly separated clusters. so it has been widely used. However. Tests have shown it to be good at recovering cluster structure. The median method is similar but the centroids of the two merging clusters are not weighted proportionally to the size of the clusters. file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. based on the Euclidean distance between centroids. Small. It has some attractive theoretical properties (Jardine and Sibson 1971) and can be implemented relatively efficiently.. All objects contribute to intercluster similarity. Ward's method: Ward's method is also known as the minimum variance method because it joins at each stage the cluster pair whose merger minimizes the increase in the total within-group error sum of squares. at each step. it is called complete link because all entities in a cluster are linked to one another within some minimum similarity.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS dendrogram is a useful representation when considering retrieval from a clustered set of documents. it has a tendency toward formation of long straggly clusters.2: Dendrogram of a hierarchical classification The most commonly used hierarchical agglomerative clustering methods and their characteristics are: single link: The single link method joins. the group average link method uses the average values of the pairwise links within a cluster to determine similarity. resulting in a structure intermediate between the loosely bound single link clusters and tightly bound complete link clusters. the most similar pair of objects that are not yet in the same cluster. The group average method has ranked well in evaluative studies of clustering methods (Lorr 1983). the centroid and median methods. since it indicates the paths that the retrieval process may follow. resulting in reversals or inversions in the cluster hierarchy. Figure 16.

file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrDo. A general algorithm is: 1. It is sometimes used to form the groups that are used to initiate reallocation clustering. calculate the similarity S with the representative for each existing cluster. and then each document is assigned to the cluster seed that maximally covers it. it is often criticized for its tendency to produce large clusters early in the clustering pass. Though the single pass method has the advantage of simplicity. The general algorithm is as follows: 1. recalculate the cluster centroid Cj. use Di to initiate a new cluster. the extent to which it is covered by itself. For Di.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS 16. 4. the cover coefficient is a measure that incorporates the extent to which it is covered by Dj and the uniqueness of Di. If an item Di remains to be clustered. and because the clusters formed are not independent of the order in which the data set is processed. that is.2 Reallocation Methods The reallocation methods operate by selecting some initial partition of the data set and then moving items from cluster to cluster to obtain an improved partition.4. 3.. For Di.4 ALGORITHMS FOR NONHIERARCHICAL METHODS 16. return to step 2. In this algorithm.Books_Algorithms_Collection2ed/books/book5/chap16. assign Di to the most similar centroid. If Smax is greater than a threshold value ST. Select M cluster representatives or centroids. 2. 16. 3.. Anderberg (1973) discusses some of the criteria that have been suggested to establish an initial partition and to monitor the improvement achieved by reallocation. An example of a single pass algorithm developed for document clustering is the cover coefficient algorithm (Can and Ozkarahan 1984).4. For i = 1 to N. a set of documents is selected as cluster seeds. add the item to the corresponding cluster and recalculate the cluster representative.1 Single Pass Methods The single pass method is particularly simple since it requires that the data set be processed only once. 2. otherwise. For j = 1 to M.htm (9 of 27)7/3/2004 4:21:40 PM . Assign the first document D1 as the representative for C1.

If objects Ci and Cj have just been merged to form cluster Cij.5 ALGORITHMS FOR HIERARCHICAL METHODS 16. Individual HACM differ in the way in which the most similar pair is defined.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS 4. If more than one cluster remains.ooks_Algorithms_Collection2ed/books/book5/chap16.5. return to step 1. Their time and storage requirements are much lower than those of the HACM and much larger data sets could be processed. Identify and combine the next two closest points (treating existing clusters as points). 2. based on the dissimilarities prior to formation of the new cluster.1 General Algorithm for the HACM All of the hierarchical agglomerative clustering methods can be described by a general algorithm: 1. the dissimilarity d between the new cluster and any existing cluster Ck is given by: dCi. Lance and Williams (1966) proposed a general combinatorial formula.1: Characteristics of HACM file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. 3. Identify the two closest points and combine them in a cluster. Table 16. and the nonhierarchical methods will not be considered further in this chapter.1 in the context of their Lance-Williams parameters and cluster centers. . the LanceWilliams dissimilarity update formula. the HACMs are now usually preferred in practice. 16. for calculating dissimilarities between new clusters and existing points. The hierarchical clustering methods previously discussed are presented in Table 16. and . such as the clustering experiments carried out in the SMART project (Salton 1971). With improved processing capability and more efficient hierarchical algorithms.jck = idcick + ajdcjck + dcicj + dcick . Repeat steps 2 and 3 until there is little or no change in cluster membership during a pass through the file. The single pass and reallocation methods were used in early work in cluster analysis in IR..dcjck This formula can be modified to accomodate a variety of HACM by choice of the values of ..htm (10 of 27)7/3/2004 4:21:40 PM . and in the means used to represent a cluster.

the dissimilarity matrix is calculated (O (N2)) and then sorted (O (N2 log N2)) prior to the construction of the hierarchy (O (N2)).htm (11 of 27)7/3/2004 4:21:40 PM . These include several algorithms for the single link method. The time requirement is O (N2).ooks_Algorithms_Collection2ed/books/book5/chap16. there are also algorithms specific to individual HACM. all of which are discussed below. and the Lance-Williams update N formula makes it possible to recalculate the dissimilarity between cluster centers using only the stored values. There are three approaches to implementation of the general HACM (Anderberg 1973). A number of algorithms for the single link method have been reviewed by Rohlf (1982). each of which has implications for the time and storage requirements for processing. and a nearest-neighbor algorithm for Ward's method. For some methods. Defay's algorithm for complete link. 16.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS Notes: mi is the number of items in Ci. This makes the method attractive both from a computational and a storage perspective. In addition to the general algorithm. so that it is one of the most widely used of the HACM. The computational requirements range from O (NlogN) to O (N5).5. and it also has desirable mathematical properties (Jardine and Sibson 1971). the storage requirement is O (N2). which minimizes disk accesses. Van Rijsbergen algorithm Van Rijsbergen (1971) developed an algorithm to generate the single link hierarchy that allowed the file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. no cluster centroid or representative is required. However. which do not require the recalculation of similarities during clustering.. this approach is suitable only for the single link and complete link methods. The data set need not be stored and the similarity matrix is processed serially. an N matrix containing all pairwise dissimilarity values is stored. In the stored matrix approach.2 Single Link Method The single link method merges at each stage the closest previously unlinked pair of points in the data set.5. rising to O (N3) if a simple serial scan of the similarity matrix is used. algorithms have been developed that are the optimal O (N2) in time and O (N) in space (Murtagh 1984). including related minimal spanning tree algorithms.5). Many of these algorithms are not suitable for information retrieval applications where the data sets have large N and high dimensionality. and there is no need to recalculate the similarity matrix during processing.. Since the distance between two clusters is defined as the distance between the closest pair of points each of which is in one of the two clusters. A stored data approach has only an O (N) storage requirement but the need to recalculate the pairwise dissimilarity values for each fusion leads to an O (N3) time requirement. the dissimilarity measure used for Ward's method must be the increase in variance (section 16. The single link algorithms discussed below are those that have been found most useful for information retrieval. In the sorted matrix approach.

a mathematical definition for these parameters is provided by Sibson (1973). which are quite simple to implement. It is simply a sequence of operations by which a representation of the single link hierarchy can be recursively updated. The hierarchy is generated in a form known as the pointer representation. O (N2) for computation and O (N) for time.ooks_Algorithms_Collection2ed/books/book5/chap16. and (i) is the last object in the cluster it joins at this level. /*iteratively add the remaining N-1 points to the hierarchy */ file:///C|/E%20Drive%20Data/My%20Books/Algorithm/DrD. In the pseudocode below.. next indicates the current pointer for a point being examined. three arrays of dimension N are used: pi (to hold the pointer representation). which consists of two functions and for a data set numbered 1. (i) is the lowest level (distance or dissimilarity) at which i is no longer the last (i. However. SLINK algorithm The SLINK algorithm (Sibson 1973) is optimally efficient. Fortran code for SLINK is provided in the original paper (Sibson 1973).. and distance (to process the current row of the distance matrix).N.. most later work with large collections has used either the SLINK or Prim-Dijkstra algorithm. and was the first to be applied to a relatively large collection of 11. It is O (N2) in time and O (N) in storage requirements.613 documents (Croft 1977). It generates the hierarchy in the form of a data structure that both facilitates searching and is easily updated.. lambda[0] = MAXINT. /* initialize pi and lambda for a single point representation */ pi [0] = 0. and therefore suitable for large data sets.Information Retrieval: CHAPTER 16: CLUSTERING ALGORITHMS similarity values to be presented in any order and therefore did not require the storage of the similarity matrix. the dendrogram is built by inserting one point at a time into the representation.htm (12 of 27)7/3/2004 4:21:40 PM . lambda (to hold the distance value associated with each pointer).e. the highest numbered) object in its cluster. with the following conditions: (N) = N (i) > i (N) = ( (i)) > (i) for i < N In simple terms.