Intelligent Search Agents

INTELLIGENT SEARCH AGENTS

BY DARREN GREAVES

5. IMPLEMENTATION DETAILS

5.1. INTRODUCTION

In this section I will document the development of the program. As I am using the prototyping development method I have broken it into three major sections, one for each prototype. Each section is then further broken down.

5.2. PROTOTYPE ONE

5.2.1. INTRODUCTION

The plan for the initial prototype is to build a simple front end for the program and try to develop a basic functionality. The reasoning behind this is to test the feasibility of certain functions, and to create a basic design for the class structure I intend to use. Any useful code can always be re-used in later prototypes.

5.2.2. PROGRAM CREATION USING MSVC

The first step in creating a Windows program that uses MFC in MSVC is to use the AppWizard. The AppWizard is a series of steps that guide the user through creating a basic program to which more functionality can be added. AppWizard allows the creation of three different types of Windows programs. A dialog based program, a single document interface (SDI) program or a multi-document interface (MDI) program.
The SDI and MDI type programs are very similar. They are used to create resizable windows that are commonly used to hold documents of some kind but can be used for many different types of programs. The difference between the two types is that SDI can only hold one document at a time (example: Windows Notepad), and MDI can hold multiple documents (example: Microsoft Word). For creating programs that need a resizable window but don't need to hold a document of any kind the SDI type is commonly used.
There are some differences between the SDI/MDI types and the dialog but the main one related to my project is the fact that the dialog based programs are not resizable and the SDI/MDI based programs are. As I was unsure about what visual appearance my program was going to take I chose a dialog based program, these are generally easier to develop as they involve a simpler class hierarchy. Dialog based programs generally tend to be simple programs, examples of Windows programs that use the dialogs are the applications in the Windows Control Panel.
The basic class structure of the three different types of programs is as shown in figure 4. It can be seen that all three are derived from the CWnd class which is the base class for all visual windows in MFC, but the SDI/MDI programs have a more complex class structure as they interact with classes to handle the storage of documents and views of the document. The CDialog class hierarchy is ideal for creating simple programs as it doesn't have the extra baggage of the SDI/MDI classes.

5.2.3. CREATION OF USER INTERFACE

So, having created a basic dialog program the next step is to create the user interface for it. Figure 5 shows the interface I created for the initial prototype.

5.2.4. CREATION OF MEMBER FUNCTIONS

Having created the basic interface I then created the member functions and member variables to deal with the interface. MSVC provides a feature called ClassWizard to link functions and variables to the user interface elements. The basic functionality I needed at the start was to take input from the user on what they wished to search for and provide results of any matches. I left the filter handling and the different methods of matching to one side for now as I could add them later.
This to me is one of the good things about the prototyping model; the ability to add features and functionality to a program as required. This ability enabled me to rapidly create a basic program and thus test the feasibility of what I thought may be the main stumbling blocks of the program, without wasting time on the time-consuming and less important elements.
All the functions and variables created go straight into the CDialog derived class. This procedure only takes a few minutes using ClassWizard. Having done this the next step is to write some code to do something useful with the functions created. The initial functions created were to handle the user pressing the Search button. The function needed to check that the user had entered valid data and then do something useful with the data.

5.2.5. BUILD A CLASS STRUCTURE

The next stage of the program is to create a class structure to handle the other functions of the program. This would be done according to the design I had already mapped out. I needed to two create additional classes, one each for the Filter Manager and the Search Engine.
The Filter Manager would handle passing data to each filter and passing results back to the main CDialog derived class to be output to the screen.
The Search Engine would handle requesting pages from the intranet or Internet and searching each page for links to further pages. As a link to a page was found the page would be read in from the intranet/Internet and passed to the Filter Manager.
First I created the class for the Filter Manager. It is possible to use ClassWizard to create a new class but only if it is derived from an existing MFC class. As I had no need to derive from an MFC class, I decided to create the new class by hand. This is simply a matter of adding new files to the project and writing in the class definition. I created the Filter Manager class this way and then added member functions for the CDialog derived class to pass in the required data.

5.2.6. DEVELOPMENT OF SEARCH ENGINE

In this initial prototype I was not implementing any code for the filters themselves. This would be left for either the second or third prototype. For this prototype I wanted to concentrate on the Search Engine. Therefore, the Filter Manager would be a very simple class that just accepted the data from the user and passed it on to the main Search Engine. The main work for this initial prototype would be in the Search Engine class.

5.2.6.1. READ PAGES OFF WEB SERVER

I have detailed in 9.1.3 below how I had set up the Apache Web Server to run on my PC. I had a small web site on my hard disk consisting of about a hundred linked HTML pages. I set Apache to point at this site and used it to deliver pages off my test site to any program that requested them. This included web browsers and of course my search engine program. Therefore the next step in the prototype was to write the code to read in pages from the web server.

5.2.6.2. MFC INTERNET CLASSES

MFC provides a series of classes to read pages from a web server, these are shown in figure 6. The classes interact in the following manner.
Two member variables are created first, one of type CInternetSession and one of type pointer to CHttpConnection. The CInternetSession class has a member function that takes a server name and port number, and returns a pointer to a CHttpConnection object. This establishes a connection with the web server and any further interactions with the server can be done through the CHttpConnection member variable.
The CHttpConnection class has a member function to retrieve a page off a web server, this returns a pointer to a CHttpFile object. Because the file's text can be read into a string there is no need to make this object into a member variable, it can exist simply as a local variable. However, before the retrieval can be performed a very important step needs to be taken that can be easily overlooked.

5.2.6.3. GOOD MANNERS

The important step is to send some information about the client to the web server. This step is not required by either the client or the server, but it is important none the less, and the reason that is so is because it gives the webmaster (the person who administers the server) some information about the program that is using their web server. The information it gives in the case of my client is simply the name of the client and my email address. This allows a web-master to contact me should my search engine not be behaving correctly on their site. Examples of bad behaviour could be, retrieving pages too fast (this is unfair on other users), getting stuck in a black hole (retrieving the same series of pages over and over in an endless loop) or possibly the webmaster may not want my client searching his or her site. Sending this information is a simple step and only takes a minute yet is often overlooked by web robot and search engine designers.

5.2.6.4. READING THE FILE

The CHttpFile object doesn't actually implement a function to read the file into a string, it's read function is inherited from it's CFile parent class. Before the file can be read a buffer needs to be created to read the file's text into. The buffer needs to be at least as big as the file's size, and as the size of the file is not known it needs to be determined. The CFile class implements a GetSize function but although it can be called from within the CHttpFile class it does not report the correct size of the html file. I then checked through the member functions of the CHttpFile class to see if there is another way to determine the size of the file. It turned out there was a general purpose function called QueryInfo that can be used to request all sorts of information from the server regarding the html file. One of these pieces of information was the size of the file. I used this function and it worked perfectly.

5.2.7. MULTITHREADED CODE

Having successfully read a file in from the server the next step was to scan it for links to other pages. After a little thought on this I decided I would perform this search using a separate thread of execution. I have explained in 9.1.2 below the technical details of writing multithreaded code, now I shall explain how I implemented it.

5.2.7.1. THREADS IN MFC

In MFC code there are two types of threads supported and consequently two ways to create them. The two types are worker threads and user-interface threads. Worker threads are used for background processing with no interaction with any of the main window's controls. A further restriction of worker threads is that the thread function cannot be a member function. With all these restrictions there doesn't seem to be much reason to use them, except for the fact that they are very simple to create. All that is required to create a worker thread is to create a single function and pass the address of it to the create thread function. If the worker thread function needs access to the class from which it was called it is possible to pass in the address of the class' object using the 'this pointer'. This can then be cast back to the correct type from within the threads controlling function.
User-interface threads have full access to the main controls of any window the program uses. They are actually created as a separate class derived from a specific class called CWinThread. User-interface threads are more complex to set up for these reasons but in return offer a greater degree of flexibility to the user. The differences are summarised in table 1 below.

Table 1 - Comparison of MFC Threads

MFC Threads	Interact with MFC windows & controls	Thread function can be a member function	Thread exists in a separate class	Ease of creation
Worker	No	No	No	Simple
User-Interface	Yes	Yes	Yes	Complex

5.2.7.2. CHOICE OF IMPLEMENTATION

I decided to implement the thread as a worker thread. My reasons were simply that unless I needed a specific feature of the more complex user-interface thread I should use a worker thread. I decided I didn't need any of it's features so I used the worker thread model. As I needed to pass data back and forth between the two threads I passed a pointer to the class' object into the worker thread's controlling function. I then cast this to the correct type from within the worker thread so I could call functions in the class from the worker thread. This was a necessary step so I could pass results back to the Search Engine class. I then defined a function in the Search Engine class to accept the results in the form of a text string of every link found and a function for the worker thread to read in the text of the html file to be searched.

5.2.8. PARSING HTML

The way that I coded the worker thread to perform the searches for links was to search out what are known as 'href' tags in HTML. The 'href' works in the following manner:
Frames in HTML code means display the word 'Frames' and if the user clicks the mouse on it, open the file called frame.html. It can also be used to point to pages located on other servers, or to graphic or any other types of files. By searching for all the 'href' tags I was able to build a list of all the links from one page to other pages or files. Figure 7 above gives a graphic overview of how 'href' tags work.
As mentioned already, the links could also be to non-html files so after finding a link I checked it for it's type. HTML files can use any one of a variety of different file extensions, including html, htm, shtml, shtm or asp. So I needed to check for any of these file extensions. Additional to this, I had decided for this initial prototype to only handle searches for one web site at a time. This meant that I also needed to check that the link was pointing to a page that was part of the same site. Having validated a link it was then added to an array within the Search Engine class. This array consisted of a list of links to html files located on the same server.

5.2.9. EVALUATION OF PROTOTYPE ONE

To summarise then; the initial prototype was capable of reading in pages off an Internet or intranet site, scanning those pages for links to further pages, then subsequently reading in those pages and scanning for further links. The searching for words specified by the user was not yet implemented.
I then undertook an evaluation of the first prototype. The following issues raised were either problems or features yet to be implemented:
� Re-design of front-end so that filters were chosen from a drop-down list.
� Program hung (stopped responding) when searching specific sites.
� User interface was not updated while search took place.
� Search results needed to be posted while the search was ongoing.
� A sample filter yet to be implemented.
I decided I could continue to use this prototype as the basis for the second prototype. By this I mean I would not start a new program from scratch but merely modify the existing one.

5.3. PROTOTYPE TWO

5.3.1. INTRODUCTION

The plan for the second prototype was to remedy the problems identified in 5.2.9 above and to further develop the program by implementing a sample filter. I will deal with each of these problems in turn and them move on to other developments in prototype two.

5.3.2. PROBLEMS AND SOLUTIONS

5.3.2.1. PROGRAM HUNG WHEN SEARCHING SPECIFIC SITES

First, I need to say something about my development and testing strategy. To enable fast responses to my web searches and hence speed up development, I had downloaded a number of web sites which I stored on my hard disk and used for searching. I was unwilling to let my program search the web (use other people's web servers) until I had a degree of confidence in it's ability. This meant that I was entirely dependent on my program's ability to search the sites on my hard disk as a barometer of it's success.
When I demonstrated the program on my supervisor's machine we tested a different site and my program failed. At first, I was unable to determine why this was happening. To be able to determine what my program was doing wrong I needed to run it inside the debugger and see what was going on internally within my program. I also needed to test it on a site that was causing it to fail. As it worked fine on my test site this proved difficult. I set about finding more sites I could download in the hope that I could find one that gave me different test results.
When I found a site which caused my program to fail, I set a breakpoint using the debugger and stepped through the code until I located the portion of code where it failed. The problem turned out to be in the code that scans a page for links to other pages. It searches for the string 'href' which refers to a link to a further page then when it has found that it then checks for a following string which will either start with ' or ". However if it could not find any quotes following the 'href' it got stuck in an endless loop. This was simply resolved by an 'else' clause to allow the loop to be exited if no more quotes were found.
Status: Fixed

5.3.2.2. USER INTERFACE WAS NOT UPDATED WHILE SEARCH TOOK PLACE

This means that the screen is not redrawn as the main thread is fully occupied performing the search. This is caused by the fact that the user interface and the search engine were controlled by the same thread. This meant that whenever a search was taking place the main display would accept no inputs and that the user was unable to move or minimise the window while a search took place and was also unable to close the program while it was searching. This was clearly unacceptable, however the solution was not straightforward. It involved having the Search Engine controlled by a separate thread, and would have involved a substantial re-write of the code to implement it. I decided to put this on hold for now and see if any other issues raised themselves, so that when I did a rewrite I would have the potential to fix more than one problem.
Status: On hold (see 5.4.2.1 below)

5.3.2.3. SEARCH RESULTS NEEDED TO BE POSTED WHILE THE SEARCH WAS ONGOING

As I had not yet implemented a filter in the program I was simply returning a Boolean value of true for every page that was passed in for searching. The problem was,. I was only sending values to the list after the search was complete. This turned out to be a less than ideal scenario. However, the fix for it was fairly straightforward, the only difficult part being getting a pointer to the visual control that contained the list. Although MFC provides code to do this, it can be difficult to find the right function at times. MFC is a very large body of code and sometimes there are many similar functions to do similar tasks. Once I had found and coded the correct function it worked fine.
Status: Fixed

5.3.2.4. A SAMPLE FILTER NEEDED TO BE IMPLEMENTED

This was the biggest change within the prototype. When planning the project I had taken a look at the different reusable object technologies available. This comparison can be seen in 9.1.1 below. After consideration of the options I had decided to use ActiveX to implement the plug-in filters. As part of my research I had read through the tutorials and help files included with MSVC. However, when I began the implementation I found a problem. This was the fact that creation of ActiveX controls as implemented within MSVC were specifically used for visual controls or for inclusion within HTML pages. My filter had no need to have any visual interface, it just needed to accept some text and return a value based on the results of a search of that text.

5.3.2.4.1. ACTIVEX FILTERS

I was a little confused on the whole ActiveX subject as I was under the impression I could use it for this task. I decided to seek the advice of a person more knowledgeable on the subject. I went back to my placement company and spoke to one of the senior software developers there. Having explained the problem to him his advice was that ActiveX was a little too cumbersome to use.
He then suggested a simpler way of doing it, using Windows Dynamic Link Libraries (DLL's). These are a type of file that exports functions that other programs can call. As the name suggests they can be loaded dynamically at run-time. This would be ideal for my program as filters can be added after the final program is complete. I had done this sort of thing before and it only took me a few hours to implement a working model of it. This involved a DLL file that supported exported functions required by my program. I still had to implement the working code inside the DLL filter. Having got a working model using DLL files I decided to drop ActiveX from my project. This was a little disappointing as I had seen the chance to learn a little about ActiveX as one of the outcomes of my project but I could not afford to waste several weeks implementing it when there was a far simpler solution already in existence.

5.3.2.4.2. IMPLEMENTING THE FILTER

The next step was to implement the code inside the filter itself. I had decided to have the filter support three exported functions. One to tell the program the name of the filter, one to tell the program what file types the filter supported and one to perform the searching. I also decided to implement the filter in C and not C++. My reason for this was it was a lot simpler to code it in C and I did not need any C++ specific features. Implementation of the first two functions was very simple as the caller passed in a buffer and the filter filled the buffer with text. The most important function was the one to search the HTML page for the required text. This function would accept a string containing the HTML and a second string of the word to be searched for. It would return a number specifying the number of times the word was found within the text. A value of zero would mean the word was not located in the HTML. If the user wished to search for multiple words then the function would be called multiple times, once for each requested word.
I had offered an option to match on exact words only. That means for example, to ensure that the letters immediately before and after the word are not other letters, but are spaces or punctuation. This means that if searching for the text 'darr' and the HTML text contained the word 'darren', then an exact word match would fail but a non-exact would pass. This turned out to be quite difficult to implement but I managed it eventually. The difficulty came from deciding what would be classed as a punctuation or space. There are quite a number of characters that can be in this list. Having implemented and tested the filter I was happy with it's performance.

5.3.2.5. RE-DESIGN OF FRONT-END SO THAT FILTERS WERE CHOSEN FROM A DROP-DOWN LIST

This involved re-designing the front end so that a drop-down list was used and now that I had implemented a filter I needed to populate the list with the filters available. Figure 8 shows the new layout of the user interface. At present, there was only one filter available but I needed to use a drop-down list to give the user the option of choosing different filters when more became available. This meant that I had to determine what filters were available for use before the user could begin the search. All the code to handle the filters was implemented in the Filter Manager class so I provided a function to locate all the filters that could be called from the main dialog class upon initialisation.
The method for locating the filters was to look in the directory the program was running from and find all files with the 'DLL' file extension. Then I loaded each DLL and checked it to see if it exported all three required functions. If it did I read in the name and the file types supported and then added these two strings to two internal arrays. I also added the filename to an internal array. After the directory scan was complete, I then added the filter names to a drop-down list within the main dialog.
I set the main dialog to call a function whenever the selection within the dialog changed (this is quite easy to do using MFC). This function would change the text displayed in the file types dialog. It would load it in from the internal array and change it accordingly. As I had stored the DLL filenames in an array once a search took place the DLL could be located immediately and it's exported search function could be used. The DLL was only loaded once per search which is of course desirable in terms of efficiency.

5.3.3. EVALUATION OF PROTOTYPE TWO

I again undertook an evaluation of the prototype. The outcome of the evaluation saw the following list of points raised:
� User interface was not updated while search took place.
� Program still hanging on certain sites.
� A resizable window should be used.
With these points in mind I moved on to a new prototype.

5.4. PROTOTYPE THREE

5.4.1. INTRODUCTION

The plan for the third prototype was as before with the second, to develop the program and to remedy the problems raised. The two major issues affecting this prototype were to make the window resizable and to update the user interface while a search was taking place. I had mentioned at the start of this section that I had chosen to implement my program based on a class hierarchy that did not use resizable dialogs. For me to use a resizable window I would need to create a new project based on a different hierarchy and then add in my existing code. The second issue was to allow the main window to receive updates while a search was taking place. To do this I would need to rewrite it as a multithreaded program. I decided that it was worth the effort of a rewrite as I could fix two problems.

5.4.2. PROBLEMS AND SOLUTIONS

5.4.2.1. A MAJOR REWRITE REQUIRED

To do the rewrite I created a new project but this time based it on the Single Document Interface (see 5.2.2 above). This meant I could now use a resizable window to display the results. The next step was to implement the Filter Manager class as a separate thread. I had explained earlier about worker threads and user-interface threads and the differences between in 5.2.7.1 above. I decided to implement the Filter Manager class as a user-interface thread as it needed to run as a separate class and only a user-interface thread would allow this. ClassWizard allows the creation of a new class based on a user-interface thread class. The class name I created was based on the MFC class CWinThread. After this I created a new class for the Search Engine too, I decided also to base this on CWinThread also. This was for reasons of consistency mainly. I also based the new thread for scanning pages for links on CWinThread too. This meant that my program would consist of a maximum of four threads during a search. The next step was to copy in the existing code from the old project and adapt it where necessary to fit in with the new classes. This proved reasonably straightforward and I was able to perform a minor clean-up of the code as I adapted it. Figure 9 shows the look of the program after the changes.

5.4.2.2. MULTITHREADED IMPLEMENTATION PROBLEMS

The next step was to implement the code to spawn each thread from it's calling thread and the code to shut down a thread upon completion. This proved difficult to implement for the Filter Manager class. The reason being that I needed to fill the drop-down list box with the choice of filters available, but this needed to be performed by the Filter Manager thread, but then I didn't need this thread again until the user started a search. However, I was storing information on filters available within the Filter Manager class. This meant I needed to start the thread, get the filters available, suspend the thread, then restart it for a new search. But when the search was complete the Filter Manager thread terminated and the next search was expecting to just resume the thread and would fail. After trying a few different methods I decided not to store any information within the Filter Manager class, then I could stop and restart it as needed and no information would be lost. The information would be stored within internal arrays within the main dialog class. So I could then start up the thread to populate the list-box of filters available, then shut it down again, then start a new thread for the search, which would terminate upon completion. This proved to be the simplest and most reliable method of handling the threads.
Status: Fixed

5.4.2.3. PROGRAM STILL HANGING ON CERTAIN SITES

As I still had problems with my program when searching certain sites I decided to do two things to help locate the bugs in my code. The first was to allow my program to search some sites on the Internet, the second was to add some tracing code to determine what was happening within my program. I will deal with each of these in turn.

5.4.2.3.1. SEARCHING THE INTERNET

To allow my program to search the Internet as opposed to my local intranet, I needed to add a small delay to my program to slow it down while it was searching. This is because my program is capable of requesting many pages a minute, enough to cause a web server to spend most of it's time servicing my program and devoting very little time to other users of the site. This is a selfish use of the Internet and would not make me popular with the Webmaster of the site (I was already sending my mail address with every request in case my program was not well behaved). In addition to this, many requests could overload the web server and cause it to crash, this is a known problem with web robots that are badly designed. To counteract this problem I added a ten second delay between every request for a page. To ensure that this delay would not affect my local intranet searching I checked the IP address of the site being searched against the address of the machine running the program. If they matched then I didn't wait ten seconds as I knew I was searching only on my own computer. Having added the delay I felt happy to allow my program to search the Internet and set it to search South Bank University's web server. Performing these searches allowed my program to be tested against a wider variety of different sites and hopefully allow me to locate bugs in the code.

5.4.2.3.2. TRACING CODE

By adding tracing code to my program I was able to leave it to search the Internet and check later for any problems. The tracing code would create a text file and output any text that I felt would help me determine what my program was doing. The tracing code I used was a class I had developed for use in other projects. It was simple to implement and allowed me to build up a log file of sites visited. Using this trace file I was able to locate another bug related to parsing HTML files. The bug I found was that I was not handling frames correctly. Frames are a way for a HTML page to be split up into separate windows where each window is a separate file. The problem is, pages in frames are no longer referenced using HREF (see 5.2.8 above). They use a tag called SRC which works in the same way as HREF. So I needed to adapt my program to search for SRC as well as HREF tags. Although this was easy to implement it managed to create a new problem. The word 'src' is also commonly used as a directory in UNIX systems, my program was seeing src in some HTML files and thinking it was a tag when in fact it was a reference to a file in the src directory.
Ordinarily this would not be a problem as my program always checks the file types of anything the HTML parser picks up. However, if this src tag was near the end of a file and there were no more quotes (' or ") in the file the parser was going into an endless loop. The fix for this was simple, just add an else clause if there were no more quotes in the file. However, I would not have been able to find this bug without the tracing code in place. This shows it usefulness to the project.
Status: Fixed

5.4.3. COMPLETION OF IMPLEMENTATION

Having fixed the final known bugs in the project the implementation phase of the project was complete. Any more ideas on extending or adding features to the project would have to go in the future work section (see 6 below). In a sense, the time to know when to stop a project could be seen as one of the weaknesses of the prototyping paradigm. It is possible to continue adding features to a program and it will never really be complete. Because this project runs to a fixed deadline which cannot be changed, that problem is easily solved. The implementation has to cease even if the program is incomplete in this case.
A possible way around this problem with prototyping would be to agree a certain level of functionality between developer and client and stick to it strictly. Then, if new features are required, a new contract can be drawn up for more work to take place.