Previous Section
 < Free Open Study > 
Next Section


Case Study

Building an Index

Problem Our publisher has asked us to produce an index for this book. The first step in this process is to decide which words should go into the index; the second step is to produce a list of the pages where each word occurs.

Instead of trying to choose words out of our heads (thin air), we decided to let the computer produce a list of all unique words used in the manuscript and their frequency of occurrence. We could then review the list and choose which words to include in the index.

Discussion Clearly, the main object in this problem is a word with associated frequency. Therefore, the first thing we must do is define a "word."

Looking back over the preceding paragraphs, what is a tentative definition of word in this context? How about "something between two blanks"? Or better yet, a "character string between two blanks"? That definition works for most of the words. However, all words before '.' and ',' would have the '.' and ',' attached. Also, words surrounded by quotes would cause a problem.

Does the following definition take care of the problem?

A word is a string of alphanumeric characters between markers where markers are whitespace and all punctuation marks.

Yes, it is a good working definition of the kind of word that would be a candidate for an index term. In fact, this definition would allow us to use the StrType defined in Chapter 2. The option ALPHA_NUM in GetString defines a string as containing only alphanumeric characters; any non-alphanumeric character stops the reading. If we skip all unallowed characters before getting the string, we should have exactly what we want.

This process ignores quotation marks, leaving only contractions as a problem. Let's examine a few and see if we can find a solution. The common contractions "let's," "couldn't," "can't," and "that's" all have only one letter after the single quote. The algorithm would return the characters up to the single quote as one word and the character between the single quote and the double quote as one word. What we really want to do is ignore the characters after the single quote. By saying that words must be at least three characters long to be considered for the index, we solve this problem. Ignoring words of fewer than three letters also removes from consideration such words as "a," "is," "to," "do," and "by" that do not belong in an index.

Brainstorming As usual, our first step is to list objects that might be useful in solving the problem. Scanning the problem statement, we identify the following nouns: publisher, index, text, word, list, pages, heads, computer, manuscript, frequency, and occurrence. Clearly, some of these nouns set the stage for the problem and are not part of the solution. Removing those nouns leave us with manuscript, word, list, and frequency.

Scenarios There is really only one scenario for this problem: Read a file (the manuscript), break it into words, process the words, and output the results. To process each word, we check its length. If it is three characters or longer, we check whether it is a word that we have processed before. If it is, we increment its frequency; if not, we add it to the list of words with a frequency of 1. The limit of three characters is rather arbitrary, however. Let's let the user enter the minimum number of characters.

Although frequency is a noun, it is a property of a word in this case. Let's combine word and frequency into a object. We need a container object (list) in which to store the items of . We can use any of the list ADTs we have written. To have the output file list the words in alphabetic order, we should use the Sorted List ADT. Here, then, are the CRC cards for and :

Click To expand
Click To expand

We are now ready to summarize our discussion in the main driver function.

Driver (Main)

Open input file
Open output file
Get file label
Print file label on output file
Get minimum word length
while more data
   string.GetStringFile(TRUE, ALPHA_NUM, input file)
   if string.Lengthls() >= minimum word length
        Initialize WordType object with string
        list.RetrieveItem(wordObject, found)
        if found
             Increment count of wordObject
        else
             list.lnsertltem(wordObject)
list.Print(output file)

Oops! Our design has a major flaw. Regardless of which list ADT we use, RetrieveItem returns a copy of the item in the list. Incrementing the returned item simply increments a copy of the item in the list. Therefore, all the frequencies would end up being 1. In fact, this problem really doesn't lend itself to using one of our list ADTs. The processing would be much more efficient if we write a single function that searches the list for a string and increments the count if it finds it and inserts a node with the string if it doesn't. When the search finds that the string is not there, it is already at the point where the node belongs.

This discussion brings up a very important point: Sometimes it is not appropriate to use an off-the-shelf container class. Using library classes-whether one provided by C++ or your own user-defined class-allows you to write more reliable software in a shorter amount of time. These classes are already tested and debugged. If they fit the needs of your problem, use them. If they do not, then write a special-purpose function to do the job. In this case, we need to write our own. The revised CRC card for the container object follows.

Click To expand

Here is a revised main function:

Open input file
Open output file
Get file label
Print file label on output file
Get minimum word length
while more data
   string.GetStringFile(TRUE, ALPHA_NUM, inFile)
   if string.Lengthls() >= minimum word length
        list.InsertOrIncrement(tree, string)
list.Print(output file)

Before going any further, we must decide on an implementation structure for ListType. No limit has been placed on the number of items in the list, so a linked implementation is appropriate. For each word, the list must be searched, either to insert a new word or to increment an already identified word's frequency. A tree-based list would be the most efficient because its search has O(logN) complexity.

In our original design, we made be a class with member functions to initialize itself, compare itself, and increment its frequency. We are using StrType for the words so we can compare two words using the relational operators. Because the container class is being designed especially for this problem, let's make WordType be a struct rather than a class and let the list be responsible for the processing.

InsertOrlncrement

if tree is NULL
     Get a new node for tree to point to
     Set word member of Info(tree) to string
     Set count member of Info(tree) to 1
     Set Left(tree) to NULL
     Set Right(tree) to NULL
else if word member of Info(tree) is equal to the string
     Increment count member of Info(tree)
else if string is less than the word member of Info(tree)
     InsertOrIncrement(Left(tree), string)
else
     InsertOrIncrement(Right(tree), string)

Print

if tree is not NULL
     Print(Left(tree), outFile)
     word member of Info(tree).PrintToFile(TRUE, outFile)
     outFile << "" << count
     Print(Right(tree), outFile)

We are now ready to code our algorithms.

#include <fstream>
#include "StrType.h"
#include <cstddef>
#include <iostream>
#include <string>

struct WordType
{
public:
  StrType word;
  int count;
};

struct TreeNode
{
  WordType info;
  TreeNode* left;
  TreeNode* right;
};

class ListType
{
public:
  ListType();
  void InsertOrIncrement(StrType string);
  void Print(std::ofstream&) const;
private:
  TreeNode* root;
};

int main()
{
  using namespace std;
  ListType list;
  string inFileName;
  string outFileName;
  string outputLabel;
  ifstream inFile;
  ofstream outFile;
  StrType string;
  int minimumLength;

  // Prompt for file names, read file names, and prepare files.
  cout << "Enter name of input file; press return." << end1;
  cin >> inFileName;
  inFile.open(inFileName.c_str());

  cout << "Enter name of output file; press return." << end1;
  cin >> outFileName;
  outFile.open(outFileName.c_str());

  cout << "Enter name of test run; press return." << end1;
  cin >> outputLabel;
  outFile << outputLabel << end1;

  cout << "Enter the minimum size word to be considered." << end1;
  cin >> minimumLength;

  string.GetStringFile(true, ALPHA_NUM, inFile);
  while (inFile)
  {
    if (string.LengthIs() >= minimumLength)
      list.InsertOrIncrement(string);
    string.GetStringFile(true, ALPHA_NUM, inFile);
  }

  list.Print(outFile);
  outFile.close();
  inFile.close();
  return 0;
}

ListType::ListType()
{
  root - NULL;
}

void Process(TreeNode*& tree, StrType string)
{
  if (tree == NULL)
  {
    tree = new TreeNode;
    tree->info.word = string;
    tree->info.count = 1;
    tree->left = NULL;
    tree->right = NULL;
  }
  else if (tree->info.word == string)
    tree->info.count++;
  else if (string < tree->info.word)
    Process(tree->left, string);
  else
    Process(tree->right, string);
}

void ListType::InsertOrIncrement(StrType string)
// Increments frequency count if the string is in the tree;
//  otherwise, inserts word with a frequency of 1.
}
  Process(root, string);
}
void Print(TreeNode* tree, std::ofstream& outFile)
{

  if (tree !- NULL)
  {
    Print(tree->left, outFile);
    tree->info.word.PrintToFile(true, outFile);
    outFile << " " << tree->info.count;
    Print(tree->right, outFile);
  }
}

void ListType::Print(std::ofstream& outFile) const
// Prints the words in the tree and their frequency counts.
(
  Print(root, outFile);
}

We leave the rest of the problem of creating the index as a programming assignment.

Testing As a test plan for this program, we can take a text file, manually calculate the words and frequencies, and run the program to check the answers. On the Web, the file Frequency.out contains the result of running the program on the file Words.cpp, using the file History.in as input data.



Previous Section
 < Free Open Study > 
Next Section
Converted from CHM to HTML with chm2web Pro 2.85 (unicode)