Intelligent Search Agents

INTELLIGENT SEARCH AGENTS

BY DARREN GREAVES

2. INTRODUCTION

2.1. OVERVIEW OF PROBLEM DOMAIN

In the paper "Archiving the Internet"[KAH WWW], published in April 1996 , Brewster Kahle (inventor of the Wide Area Information Service WAIS, and founder of the Internet Archive) stated that the current estimates of the size of the World Wide Web (WWW) were around 50 million pages (see Figure 1- Estimated Size of WWW). Furthermore this figure is estimated to be doubling every six months. This means that at the time of this report (May 1998), there could be as many as 800 million pages on the WWW. An even more surprising statistic was that the average life of a web page was only 75 days. Add to that the fact that the range of different topics covered by different sites is almost as many (if not the same) as the range of human interests on the planet.

So it can be seen that there is a vast resource of published information containing almost every conceivable subject known. People can access this resource using just a computer, a modem and a phone line. The only problem is that there is no central repository of sites, no address book or directory listing. The problem facing a user then is how to sift through the massive resources available to find relevant information to their query. This project will be looking at the solutions available and implementing a simple solution of it's own.
The research component will take a look at the range of projects involving intelligent agents. I intend to draw some conclusions on this work and to give a picture of the current status of the intelligent agent paradigm.

2.2. POSSIBLE SOLUTIONS

2.2.1. WEB BROWSING

There are a number of solutions available to the user who wants to find something relevant to a particular subject. One solution is to simply browse the web in a random manner, clicking on any links that seem interesting and hoping to find something relevant. Although an enjoyable way to spend an evening, it is unlikely to yield much success, although that depends of course on the subject matter. The more popular the subject matter the more likely you are to stumble across something. So, it should be seen that this method is very ineffective for finding information, especially if you are in a hurry.

2.2.2. SEARCH ENGINES

Because of the size of the Internet other solutions have evolved for finding information. The most common solution to this problem at the moment is to use one of the popular mainstream Internet Search Engines. These are fast, powerful servers connected to the Internet that deal with many thousands of queries per day. They hold a large database of sites and perform text string searches on user queries and present them with links to sites that match their query. The main problem with these sites is that the information in the database is often out of date. Updates take place at the discretion of the people running the site and can commonly happen on a six-weekly basis or much longer. Given that the estimated average life of a web-page is only 75 days it can be seen that these databases can easily fall out of date. Other problems with these sites are that their pages tend to bombard the user with advertising images, they sometimes give incorrect results and almost always report pages that are out of context to what the user originally was searching for.
So, although it can be seen that Internet Search Engines represent a flawed solution to the problem of locating information on the Internet, they are at present the simplest way to perform such searches and hence the most popular.

2.2.3. INTELLIGENT SEARCH AGENTS

A newer and less common solution to the problem is what have come to be known as intelligent search agents (ISA's). The name itself needs a little clarification:
"intelligent" - So called because they are designed to be capable of not simply matching the exact words asked for, but of matching words with similar meanings, and ensuring words are in their correct context.
"search" - They are still performing the same fundamental task as regular search engines, that of retrieving pages from the Internet and parsing them for information.
"agents" - The concept of an agent is that of something acting on behalf of someone else. The intention behind this word is then that a user would give the agent a set of instructions and the agent would then carry them out on behalf of the user and report back the results at a later stage.
So, in an example usage for an ISA, a user would ask an agent to find information relating to a particular subject, fly-fishing for example, and the agent would then perform searches of live Internet web pages and prepare a concise summarised report of what information is available on the World Wide Web. This search and report would take about a day to produce so the user would make the query and then go about their business and get the results at a later stage. To many people this sounds less desirable than an Internet Search Engine. They at least they give quick results, and most people don't have a day to wait between searches. It is true that, for quick results, nothing will beat a regular Search Engine, but agent Searching has other advantages. For example; a user could tell an agent to gather all the latest information on a range of subjects and present it every morning, as a daily personalised news service. It could be presented exactly how the user wants it and the agent could learn what the user did and didn't like and it could improve based upon this. As the agent paradigm grows we should see more varied uses for this new technology.

2.2.4. CATEGORIES OF AGENTS

The basic way agents work can be separated into two different categories. This is a brief summary of the two different types.

2.2.4.1. MOBILE AGENTS

Mobile agents are small self-contained objects that can move from one server to another on the Internet whilst retaining their code and variable states. For the code to execute the server needs to be running a daemon (small background process) that is compatible with that object. Therefore, an object can only move to a site that is already configured to receive that type of mobile agent. Mobile agents represent probably the cutting edge of agent technology, but until many more web servers are capable of hosting them they will probably remain at the cutting edge.

2.2.4.2. NON-MOBILE AGENTS

Non-mobile agents could also be termed as server-side agent technology. That is, the agent code runs on a server that is connected to the Internet and simply reads web pages onto the server for processing. This type of agent is more prevalent than the mobile agent mentioned in 3.2.4.1, probably because it is an easier system to develop, and also because it does not rely on other web servers running agent related code. A common use for this type of agent would be to have clients running an agent program and passing any information to the server which then does all the processing and passes all the results back to the client. Off-the-shelf agent software can be purchased that already works in this way. Agentware from Autonomy[AUT WWW] is one such product.

2.3. MY SOLUTION TO THE PROBLEM

The implementation part of this project will be to design and build a non-mobile Internet/Intranet search engine. The main features of the program will be the following:
· Perform searches of web based content, either on the Internet or an Intranet.
· Present the list of matched pages in a ranked list to the user.
· Matched pages can be viewed in a web-browser by simply double-clicking an entry in the list.
· Provide the ability to add plug-in filters to perform different types of searches.
· The program will have one filter included that can perform text based queries.
· Adding a filter will require no changes to the original code, it will automatically detect and use any filters that exist in the program directory.
The program will be primarily designed and tested on an Intranet, but tests on Internet sites will be conducted also. It will search through a series of pages from a base URL by traversing the links from each page and following them for more links. As each page is found it will be passed internally to a series of plug-in filters, This filter will perform a simple text based match of each page's contents against the text the user is searching for. The program will be designed to accept additional filters to perform different types of searches. Examples of these could be a synonym or graphic filter. Example filters will be considered in the future work section (6.4 below).