THTTPSCAN

current version 2.2b    Sept 24, 2000

 

Description - Features

Registration

Disclaimer

Package Install

Methods and Functions

Start
Stop

Properties

Agent
ConcurrentDownloads
DephtSearchLevel
HttpPort
FileOfResult
LeavesFirst
Referer
Retries
ReuseCache
SeekRobotsTxt
StartingUrl
StayOnsite
TimeOut
UserName
Working

Events

OnError
OnLog
OnLinkFound
OnPageReceived
OnUpdatedStats
OnWorking

 

 

HTTPSCAN PACKAGE INSTALL

1.  If and old THttpScan package is already installed, FIRST REMOVE IT:

- Components | Install Packages
- Click on "Michel Fornengo's Components"
- click "Remove"
- click "Yes"
- click "Ok".

2.  Install the current package :

- unzip the archive in a folder of your choice (e.g. c:\httpscan)

- copy the following files to delphi5\Imports (or delphi4\imports) :
httpscan.bpl
httpscan.dcp
httpscan.dcu
MF*.dcu (all the .dcu files beginning with MF)

- run Delphi

- select Component | Install packages

- press the "Add" button

- locate the httpscan.bpl file in the Imports directory and select it

- select Open

- select Ok

- check the mfornengo tab in the right of the component palette. The httpscan object should have been added.

3.  If you have created a project with a previous release of THttpScan :

To make Delphi recognize the new parameters in events, proceed as follows :

- cut and save the code of the THttpScan events with new parameters (OnLinkFound and OnPageReceived)
- remove the THttpScan component of your project

- put a new THttpScan component on your project

- go to the THttpScan's object properties inspector

- double click on the events with same parameters. They will find their existing code.

- double click on the events with new parameters to create the emty procedures, then paste the saved code.

 

methods and functions

function Start : boolean    (1st syntax)
starts downloading and processing the URL set in the StartingUrl property wich must have been set beforehand.

function Start (StartingUrl_ : string) : boolean   (2nd syntax)
starts downloading and processing the URL set in the StartingUrl_ parameter passed to the function.

 

procedure Stop : kills all HttpScan processes currently running. Must be called before closing the Form. The Form can be closed after the OnWorking event occurs (false) or the Working property returns false.

 

properties

Agent : string = ' '
contains the name of the application or entity sending Http requests (e.g. : "YourApp").

ConcurrentDownloads : integer = 2
number of html pages downloads running simultaneously (between 4 and 20, according to your ISP speed and your processor is a good range).

DephtSearchLevel : integer = 2
represents the deep of  the followed pages tree starting from the first url. Or "each time I find a link, I click on this link, n times". If kept on the host of the starting url with StayOnSite set to true, a high value allow to grab an entire web site. 
The most important parameter with StayOnSite.

HttpPort : integer = 80
http port of the starting url

FileOfResults : string = ' '
complete path of the file in which to store the results of the processing.

LeavesFirst : boolean = false
allows to initially traverse the leaves of the html pages tree before the branches.

Password : string = ' '
needed if the starting url is username/password protected.

Referer : string = ' '
the address (URL) of the document from which the URL in the request was obtained. If this parameter is left blank, no "referrer" is sent.

Retries : integer = 3
number of download retries when an http error occurs.

ReUseCache : boolean = false
if set to true, the local cache file is read before downloading pages.

SeekRobotsTxt : boolean = false
if set to true, THttpScan searches for robots.txt files at the root of the sites (http://www.hostname.foo/robots.txt). If the file is found, the body content is returned by the OnPageReceived event

StartingUrl : string = ' '
the url from wich the scanning will be performed. 
Must be set before calling the Start function if she is called without Url parameter.

StayOnSite : boolean = false
if set to true, the links to urls with different host name than the start url are ignored. It allows to browse an entire web site with a high DephtSearchLevel value. If false, be careful. With a DephtSearchLevel greater than 2 on pages with a lot of links, you'll start to scan the whole internet !
The most important parameter with DephtSearchLevel.

TimeOut : integer = 300
time left to the http thread to connect to an URL (in seconds) before aborting process. The thread tries to connect Retries times before the OnError event occurs.

UserName : string = ' '
needed if the starting url is username/password protected.

Working : boolean = false (read only)
indicates the state of httpscan : "waiting" or "working". Can be tested before closing the Form to prevent error messages if downloads are currenly running. To use with the "stop" method. You can use also the OnWorking event.

 

events

OnError (Url: String; ErrorCode: Cardinal; ErrorMsg: String);
occurs when a "GET" request fails. Returns the url wich failed, with the error code and the error message if availables.

OnLinkFound (UrlFound, TypeLink, FromUrl, HostName, UrlPath, UrlPathWithFile, ExtraInfos: String; var WriteToFile: String);
This event occurs each time a link is found :
UrlFound : the full address on the link found
TypeLink : type of link (htm, jpg, mpg, cgi, php, etc...)
FromUrl : the referring url (.htm) from wich the link come from
Hostname : the host name of the UrlFound address
UrlPath : the url path (without host name & without filename)
UrlPathWithFile : the url path (without host name but with filename)
ExtraInfos : the extra infos passed to the URL (e.g. ?param1=v)
WriteToFile : the line to be written to FileOfResult. See comments here.
HrefOrSrc :
returns 'S' if the link is an object loaded on the page (a thumb for example) and 'H' if the link is the destination URL.
CountArea :
all the area found receive a sequential number. When a Href or Src link is found, it receives the number corresponding to his area. So, the couples Href / Src link can be associated.
FollowIfHtmlLink :
if false, THttpScan doesn't continue searching in the direction of the current link.

Onlog (LogMessage : string);
returns a string wich explains the internal analyze (for debugging purposes)

OnPageReceived (Hostname, Url, Head, Body : string);
occurs each time an html page is downloaded. Returns the Headers and the body text of the page.
Url :
url of the text page received
Hostname :
hostname of the page received
Head : head of the http query request for the page received
Body : body of the text of the page received.

OnUpdatedStats (InQueue, Downloading, ToAnalyze, Done, Retries, Errors: Integer);
occurs each time something changes in the httpscan state.Returns the number of pages in queue (waiting for download), the number of pages currently downloading, the number of pages waiting to be analyzed, the number of pages analyzed (done), and the number of page downloads in error.

OnWorking  (working_ : boolean);
occurs when httpscan pass from the state "waiting" to the state "working" and opposite. Can be used to detected when HttpScan has terminated his job. You can use also the Working property.

 

Comments on the WriteToFile parameter used in the OnLinkFound event :

WriteToFile contains the string to be written to the FileOfResults. If you leave it untouched, for each link found a line is written to the file like this  : "TypeLink";"NewUrl";"HostName".

WriteToFile is useful to write links to the FileOfResult file only for some kind of links (e.g. "jpg"), or to choose the informations written to the file. For examples :

If you want to write your own data to the file, e.g. Typelink, NewUrl and FromUrl then add the following line in the event :
WriteToFile := '"' + TypeLink + '";"' + NewUrl + '";"' + HostName + '"';

If you want to skip the event's link and not to write anything into the file for the current link found, simply add the following line in the event :
if ...=... then begin
   WriteToFile := '';
end;

 

DESCRIPTION - FEATURES - REGISTRATION

DESCRIPTION

With THTTPSCAN you access to web sites as a collection of links to files and data, instead of as graphics and text.

THTTPSCAN recursively analyses HTML pages and extracts all the links found with detailed informations (document type, referer, host name,...). Links are followed through HTML pages in the neighborhood of the initial URL.

Events are generated for each link found and each page read. The "depth search level" and the "stay on site" parameters allow powerful searches and full sites files view.

THTTPSCAN saves you having to tangle with the Microsoft Wininet API functions and the internet address syntax analysis. Most common parameters can be simply set from the Object Inspector. It can be placed on any window, it is only visible at design time.


USE

THTTPSCAN is the basic tool to create (and not limited to):

custom search engines : search without a browser. THTTPSCAN finds the pages and returns the contents and the linked files list.

multimedia finders : list the mp3s, jpgs, mpgs files linked to a site or in the neighborhood of a site

download managers : THTTPSCAN gives you the whole list of the links.

site changes monitoring : create an automated tool to monitor when new links are added to a site, or when the content of the pages has been changed.

BENEFITS

Simple to use integration to a Delphi application.
Single registration per developer - no on-going licence fee

FEATURES

asynchronous-nonblocking transactions

ability to keep searching on the initial site (stay on site)
doesn't return urls found to others sites if not wanted

depth search level from 1 (same page) to n
a high level with "stay on site" enabled grabs the whole site
a high level without "stay on site" enabled grabs all the links on all the pages from the starting url, until the deep search level reaches an choosen value.

event generated on each link found (no polling necessary) with the following parameters :
link found
type of url (htm, jpg, mpg, cgi, etc...)
refering url (url the link found is coming from)
hostname
path
extra infos
href or src type
href/src area sequential number
possibility of not continuing search in certain directions

extract links from frames

event generated on each HTML page read
page address with hostname extracted
full query status
full page content

possibility to seach for robots.txt files

proxy support with username/password through Control Panel | Internet Options parameters

SYSTEM REQUIREMENTS

Windows 95/98  /  Windows NT 3.5 or 4    /  Windows 2000
Delphi Version 4 or 5

 

Disclaimer

The author of this program accepts no responsibility for damages resulting from the use of this product and make no warranty or representation, either expressed or implied, including but not limited to, any implied warranty of merchantability or fitness for a practical purpose.

These software packages are provided here "AS IS", and you the user, assume all risks when using them.

 

REGISTRATION

Register THTTPSCAN and you will get the full source code. Registration costs :

US  $27
EURO  29
French Francs  190

 

You may register online at http://www.getsoftware.com.

THttpscan home page: http://www.mfornengo.ath.cx.