INTELLIGENT SEARCH AGENTS

BY DARREN GREAVES




9.1. APPENDIX A - TECHNICAL BACKGROUNDERS
9.1.1. ActiveX and other Component Technologies
9.1.1.1. Component Technology.
9.1.1.2. OLE
9.1.1.3. COM
9.1.1.4. DCOM
9.1.1.5. ActiveX
9.1.1.5.1. ActiveX Controls.
9.1.1.5.2. ActiveX Containers.
9.1.1.5.3. Usage
9.1.2. Multithreaded Processing.
9.1.3. Apache Web Server
9.2. APPENDIX B - REFERENCES & BIBLIOGRAPHY
9.2.1. World Wide Web references
9.2.2. Journal References
9.2.3. Bibliography
9.3. APPENDIX C - INDICES
9.3.1. Index of Images
9.3.2. Index of Tables
9.4. APPENDIX D - GLOSSARY

9. APPENDICES
9.1. APPENDIX A - TECHNICAL BACKGROUNDERS

9.1.1. ACTIVEX AND OTHER COMPONENT TECHNOLOGIES
When designing the system originally I decided to use a Component Technology to develop the plug-in filters. In this section I shall first explain what Component Technology is, then what different systems are available for the environment I am using (Microsoft Windows and C++), and finally what specific features I will be using for my project.
9.1.1.1. COMPONENT TECHNOLOGY
First, what is a component? A component is an object that can be used to add functionality to an existing program. By object it is meant that it cannot be run by itself, it must be called from an external program. This program needs to be designed to work with components (component-enabled). A common use for components is to add functionality to Internet Web Browsers. In this case components (also known as plug-ins) can be downloaded off the Internet as required to add a new feature or the ability to read a new type of data. However, any program can be component-enabled, and can then work with external objects.
The different systems on offer for the Microsoft Windows platform are JavaBeans and the range of OLE (Object Linking and Embedding) based technologies from Microsoft. JavaBeans are not an option for this project as all the coding is in C++ so far and it is not possible to get the main program to communicate with an object that is coded in Java . This leaves the range of technologies from Microsoft, these are as follows: OLE, COM, DCOM and ActiveX. Figure 10 shows how the technologies interact.

9.1.1.2. OLE
OLE is the base technology from which all the others originally stemmed and although it is purely a Compound Document technology now it used to contain the basic features of Component Technology. Then Microsoft split ActiveX from OLE and OLE now relates purely to Compound Documents.
9.1.1.3. COM
COM (Component Object Model) is the underlying code that allows both ActiveX and OLE to work. It provides the basic functionality for objects to communicate by specifying a standard to use when communicating. All of the technologies mentioned here use COM implicitly.
9.1.1.4. DCOM
DCOM (Distributed Component Object Model) is an update to COM that allows objects to communicate when they are located on different machines. The machines have to be connected by some kind of network, but this network could be anything from a small local LAN (Local Area Network) to the Internet.
9.1.1.5. ACTIVEX
ActiveX is the technology that sits on top of COM and DCOM. It is the level at which components are created. It basically takes the functionality provided by COM and DCOM and presents a set of API's to the programmer.
The main features of ActiveX are as follows:
9.1.1.5.1. ACTIVEX CONTROLS
These are the controls written to be used within ActiveX programs. Typical controls would be an enhanced date control, or a spell-checker control. The main point is that the control could be used within other programs.
9.1.1.5.2. ACTIVEX CONTAINERS
These are the programs that use ActiveX controls. They are designed from the start to host ActiveX controls. The program can make use of a control as long as it is aware of an agreed interface between the container and the control.
9.1.1.5.3. USAGE
The way I intend to use ActiveX is to develop each filter as an ActiveX control and use the main program as a ActiveX container. Each filter control will conform to an agreed standard interface and the container will then communicate with the filter using this interface. This way it will be possible to add newer filters at a later date. This will add value to the program and make it stand out from most other search engines.
9.1.2. MULTITHREADED PROCESSING
This section presents a brief overview of Multithread programming:
A thread is a mini-process, or a path of execution through a process. A process can be made up of many threads or just a single thread. A process with a single thread is known as a single-threaded process, a process with multiple threads is known as a multithreaded process.
The analogy between a thread and a process is similar to the analogy between a process and the operating system except in one crucial area. Threads share and have full access to the same piece of global shared memory within the process, processes don't have such access. Threads though do have their own private stack so they have a private copy of local variables and such, the shared access would be for member variables. Because of this, synchronisation between threads is a key issue when they are accessing that shared memory. Fortunately, the MFC framework provides several classes to handle such synchronisation issues.
In any event synchronisation is not an issue as the threads I am using don't access any shared memory. Running a process in a multithreaded manner provides a reasonable performance boost and is relatively simple to code. If I had not written any multithreaded code before I may have been tempted not to start now as running the process single-threaded would not have been a major performance drain. However, having written multithreaded code on previous occasions I have personally found it to be at least as simple in practice as it is in theory and I have no qualms about producing code in this way.
It is always important to offset the gains of writing code in a certain way versus the time and difficulty in producing the code. On this occasion I have found that the benefits outweigh the effort involved.
9.1.3. APACHE WEB SERVER
It was a requirement of this project to be able to read web pages from an Intranet for processing. To fulfil this requirement a web server was required.
The major requirements had to be that it was a free product and that it would run under Windows NT. At the outset of the project a free Web Server was available from Microsoft but it was only available for Windows 95. I had heard that a NT version would be available but I did not know how long that would take. I was a bit worried that I would not be able to get a Server for the project.
Then I found out about Apache Web Server and that they had just ported their server to the Windows platform. I downloaded it and after setting it up it works fine and has been an ideal addition to my project.
9.2. APPENDIX B - REFERENCES & BIBLIOGRAPHY
9.2.1. WORLD WIDE WEB REFERENCES
ReferenceURLDate Accessed
[KAH WWW]http://www.archive.org/sciam_article.html 08/05/1998
[AUT WWW]http://www.agentware.com15/05/1998
[HAR WWW]http://harvest.transarc.com10/05/1998
[HA2 WWW]http://harvest.transarc.com/afs/transarc.com/public/trg/Harvest/papers.html16/05/1998
[GLI WWW]http://glimpse.cs.arizona.edu16/05/1998
[ROB WWW]http://www.searchenginewatch.com/webmasters/spiderchart.html16/05/1998
[ARP WWW]http://www-ksl.stanford.edu/knowledge-sharing/index.html14/05/1998
[KQM WWW]http://www.cs.umbc.edu/kqml16/05/1998
[SHO WWW]http://www-csli.stanford.edu/csli/9495reps/interface9495-shoham.html16/05/1998
[ABE WWW]http://www.networking.ibm.com/iag/iagsoft.htm16/05/1998
[AGL WWW]http://www.trl.ibm.co.jp/aglets16/05/1998
[MAE WWW]http://pattie.www.media.mit.edu/people/pattie/SciAm-95.html16/05/1998
[WOR WWW]http://www.cogsci.princeton.edu/~wn14/05/1998
[VIR WWW]http://www.virage.com14/05/1998

9.2.2. JOURNAL REFERENCES
Reference Journal Page No.
[MAES94] "Agents that reduce Work and Information Overload" CACM July 94/Vol.37, No.7 pp31-40.
P Maes 8
[GUP 97] "Visual Information Retrieval" CACM May 97/Vol.40, No.5 pp 71-79
A Gupta and R Jain 8

9.2.3. BIBLIOGRAPHY
This list refers to general literature I read that gave me background knowledge on this subject.
"Scalable Internet Resource Discovery - Research Problems and Approaches" Bowman, Danzig, Manber, Schwartz. CACM July 94/Vol.37, No.7
Study by the authors of Harvest
"The World Wide Web: Quagmire or Gold Mine?" Etzioni. CACM Nov 96/Vol.39, No.11
Considers effectiveness of data-mining on the Web
"Software Agents" Genesereth, Ketchpel. CACM July 94/Vol.37, No.7
Considers what needs to be done to make software agents viable
"The World-Wide Web" Berners-Lee, Cailliau, Luotonen, Nielsen, Secret. CACM July 94/ Vol.37. No.7
Gives an overview of how the web works, useful for development of program
http://www.searchenginewatch.com
Presents a useful resource on the popular Search Engines
http://www.byte.com/art/9706/sec6/art4.htm
A review of Intranet Search Engines

9.3. APPENDIX C - INDICES
9.3.1. INDEX OF IMAGES
Figure 1- Estimated Size of WWW
Figure 2- Design One
Figure 3 - Design Two
Figure 4 - The MFC hierarchy
Figure 5 - Prototype One
Figure 6 - The MFC Internet Classes
Figure 7 - How HREF tags work
Figure 8 - Prototype Two
Figure 9 - Prototype Three
Figure 10 - ActiveX Technologies

9.3.2. INDEX OF TABLES
Table 1 - Comparison of MFC Threads
9.4. APPENDIX D - GLOSSARY
MFC
Microsoft Foundation Classes - A framework for developing Windows Applications

MSVC
Microsoft Visual C++ - The development environment I used for the program creation

URL
Universal Resource Locator - The unique address of any computer attached to the Internet

'this' pointer
Is a pointer to the current class' object in C++. It is mostly implicit but is sometimes required when passing parameters

Common Gateway Interface (CGI) scripts
A method for integrating code into HTML files to execute programs that exist on the server