INTELLIGENT SEARCH AGENTS

BY DARREN GREAVES



6. FUTURE WORK
6.1. INTRODUCTION
6.2. ADDITIONAL FEATURES
6.2.1. Off-line database
6.2.2. Web Interface To Program
6.3. FEATURES THAT SHOULD HAVE BEEN IMPLEMENTED
6.3.1. Search Multiple Sites
6.3.2. Improved Separation of Different Parts of Program
6.3.3. ActiveX Filters
6.3.4. Known Bugs
6.3.4.1. Release Build Crashing
6.4. IDEAS FOR SEARCH FILTERS
6.4.1. Introduction
6.4.2. Synonym Filter
6.4.3. Graphic Filter

6. FUTURE WORK
6.1. INTRODUCTION
This section will detail all the features I wanted to add to the project or that occurred to me after development had ceased that could not be implemented for any reason. It will explain what my reasons were for wanting to implement a particular feature and why I didn't. I will also include a small section on ideas for additional search filters.
6.2. ADDITIONAL FEATURES
6.2.1. OFF-LINE DATABASE
This was a suggestion from my project Supervisor. It was to store all the sites searched in a cache on the local disk. The reason for this would be in an office environment for example: to allow searches to take place at night using either cheap telephone access or an ISDN line that was not in use. Then, the sites searched could be accessed the following day on a local intranet, thereby allowing everyone fast and free access to the information that had been searched for. The program could easily generate a HTML index to all the files that were matched and publish this as part of the intranet also.
To implement this feature I would need to first of all adapt the program to make a copy of each file that it matched, this would be simple to implement, one thing to bear in mind would be whether to also save graphic files that were referenced from the html pages matched. Sometimes these graphics are relevant to the topic and sometimes not. The generation of a html file would be a relatively simple job to implement also. My reason for not doing this was the simple reason that I ran out of time.
6.2.2. WEB INTERFACE TO PROGRAM
This was an idea I had that would allow the program to run on a server with an Internet connection and allow anyone to request searches for it to perform via a web browser interface. The advantages of this are that it makes it multi-platform as any computer can now utilise it and it means that it can run on a machine with a fast Internet connection and people with a slow dial-up connection would still be able to make use of it.
The changes required would be to build a web interface using a HTML form, and add a Common Gateway Interface (CGI) script to allow the server to launch this program and pass the parameters to it using the command line. The program would also have to produce a HTML index file of it's results. The address for this index could even be emailed to the user so they would not have to wait online for the search to complete. This idea occurred to me after the development of the program had ceased so there was no opportunity to implement it.
6.3. FEATURES THAT SHOULD HAVE BEEN IMPLEMENTED
6.3.1. SEARCH MULTIPLE SITES
At present, the program can only search pages on the site passed to it by the user. It was implemented this way to simplify the Search Engine. In practice, the current engine would just not follow any links that didn't refer to the same site. The changes necessary to allow the site to search multiple sites would first off be to allow the links to be followed then to create a new connection for every new server searched. At present the connection is handled by a member variable pointer. It would probably be necessary to implement this as an array or list of possible connections and there would need to be a way to keep track of which variable referred to which server. An alternative would be to spawn a new search engine thread for each different server, although a check would need to be made on how many threads were spawned as too many threads can slow down a program.
To implement this would be a major piece of work I believe. But, as it stands, being able to search only one site is not very useful so it would be a necessary step to improve the program. The other thing about searching different sites is that the program would never stop searching, as each site references more sites and then each of those indexes more again in an ever increasing list. A solution to this would be to allow the user to impose a limit on matched pages, or just a limit of sites to search. A different option would be to allow the user to specify a domain to only search within (eg, .com or .ac.uk). Even this though would take too long to search fully. These are the sorts of things that need to be taken into consideration when looking at a problem like searching the Internet because of it's sheer size. My reason for not implementing this was simply that I did not have time.
6.3.2. IMPROVED SEPARATION OF DIFFERENT PARTS OF PROGRAM
This refers to an improved separation of each part (Search Engine, Filter Manager, HTML Parser and front end) to allow the program to be separated over a network. For this to take place a Windows component technology would have to be used. Microsoft's Distributed Component Object Model (DCOM) would be ideal for this. This would require a fairly major rewrite although having already implemented the different sections to run as separate threads of execution within separate classes obviously helps.
The advantages of this would be to allow scaling the program upwards to handle searching of larger sites and multiple sites. Perhaps the program could spawn a new instance of the Search Engine to search a different site (see 6.3.1 above) on a PC that was not being used at that particular time. This would be useful in an after-hours office environment where many machines are left idle for many hours. This would allow the program to search many more sites at a time (provided there was an Internet connection with enough bandwidth). The problem with this sort of approach is that algorithms to determine machines that are idle and to partition jobs between them are still not very mature. A study of this area could probably form a final-year project in itself. This idea occurred to me after development of the project had ceased so there was no time to implement it.
6.3.3. ACTIVEX FILTERS
In section 5.3.2.4 above I put forward reasons why I decided not to use ActiveX filters for this project. However, as the project has progressed I have come to realise that there would be a way to implement the ActiveX filters. My main reason for not using ActiveX was that it required a visual control and I had stated that I had no need for a visual control. At present though the input options for the user to enter text and whether to search on exact words only and all or any words (see 5.4.2.1 above for a screenshot) really should be filter-specific but are implemented as part of the user interface code.
If I were to implement each filter as a visual ActiveX control and include these options as part of the control it would be more logical and make it simpler to add new controls. For example, a filter to search for graphic files would not require any of the current text searching options but would instead require a different choice of options to the user. I feel that implementing the filter using ActiveX would have been the correct way to do it but would have been a steeper learning curve. As it stands the current filter matching options would work fine for any text based filters anyway. This idea occurred to me after I had implemented the filter system using DLL's (see 5.3.2.4.1 above) and there was no time to go back and redo it. There was still a steep learning curve regarding ActiveX anyway and I could not be sure I could get it to work. Within the scope of a small and time-limited project such as this one it is important to weigh up an ideal implementation versus a non-ideal one that already works. The working implementation would usually win in this case.
6.3.4. KNOWN BUGS
This section details any known bugs in the system that were uncovered after development had ceased, or I was unable to fix.
6.3.4.1. RELEASE BUILD CRASHING
When developing software using MSVC it is common to develop using what is called a debug build of the software. This compiles the program with additional information to allow the debugger to match the program with the source code. This makes it possible to run the program step by step through the debugger (an essential strategy for development). The debug build also has other features to allow a developer to locate memory leaks and buffer overruns. The debug build is typically four times the size of a release build so when a program is complete a release build would be created which does not contain any of the debug code.
There are situations where the debug program works fine but the release build fails. I was unfortunate in that my program suffered one of these problems. I was able to locate the source of the problem but was unable to fix it. The problem was where the main program called the function exported by the filter. As soon as the function executed the program simply ceased running. It did not crash or give an error but simply acted as if it had been shut down. I spent many hours trying to determine the problem and fix it but was unsuccessful. So, I have had to submit the debug build of my program instead which I was not happy about but I had no choice.
6.4. IDEAS FOR SEARCH FILTERS
6.4.1. INTRODUCTION
Within the scope of the project I have only implemented a single filter. This filter can be used for simple text searches. However, I have implemented a system for adding new filters simply by dropping the filter file into the program directory. The search program will automatically recognise a new filter and begin using it. In this section I will be considering types of filters that could be added to the system.
6.4.2. SYNONYM FILTER
A Synonym filter could be used to match on text like the text filter does but to match on similar words also. For this to be implemented a database of synonyms would need to be available. The Cognitive Science Library at Princeton University (NJ) have created such a database called WordNet[WOR WWW]. It is available for download and is free for use in any type of project. It provides a C callable API for passing in a word and getting a list of similar words back. There is an online version also but it is only searchable via the web. It is relatively large at 15MB to download and about 30MB after installation, so it would not be convenient for every user to have a copy of it on their hard disk. It would be relatively simple though to provide a networked API to it that could be called using TCP/IP. Then a single copy could exist on a network (or even the Internet) for any search agent client to use.
6.4.3. GRAPHIC FILTER
Graphic filtering and searching is a relatively new area but some research has been published. Visual Information Retrieval[GUP 97] is a paper presenting an overview of what is being done in this area. A company called Virage[VIR WWW] have created a visual information retrieval engine that can perform searches of images. Their web site provides a demonstration of it's abilities and it is pretty effective. Although this engine is a commercial product it shows what is possible in this relatively new area. For this to work with my Search Engine ActiveX filters would need to be used (see 6.3.3 above) to allow the user to select the different range of options necessary for graphic searching.