131 lines
5.8 KiB
Text
131 lines
5.8 KiB
Text
The TRAVERSAL code from old versions of Lynx has been upgraded by David
|
|
Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be
|
|
implemented via a command line switch (-traversal) instead of via a
|
|
compilation symbol for creating a separate Lynx executable as in those
|
|
previous versions, and can be used in conjunction with a -crawl switch
|
|
to make Lynx a front end for a Web Crawler.
|
|
|
|
|
|
Usage:
|
|
|
|
lynx [-traversal] [-realm] [-crawl] ["startpage"]
|
|
|
|
|
|
Added switches are:
|
|
|
|
-traversal Follow all http links derived from startpage that are
|
|
on the same server as startpage. If startpage isn't
|
|
specified then the traversal begins with the default
|
|
startfile or WWW_HOME.
|
|
|
|
-realm Further restrict http links to ones in the same realm
|
|
(having a matching base URI) as the startpage (e.g.,
|
|
http://host/~user/ will restrict the traversal to that
|
|
user's public html tree).
|
|
|
|
-crawl With [-traversal] outputs each unique hypertext page
|
|
as an lnk###########.dat file in the format specified
|
|
below. With [-dump] outputs only the startpage, in
|
|
the same format, to stdout.
|
|
|
|
|
|
Note on startpage:
|
|
|
|
If a startpage is specified and contains any uppercase
|
|
characters, on VMS it should be enclosed in double-quotes.
|
|
The code that extracts the access and host fields from
|
|
startpage for comparisons with links to ensure they are
|
|
not on another server, and the comparisons with already
|
|
traversed links, are case sensitive, and the startpage
|
|
will go to all lowercase on VMS if no double-quotes are
|
|
supplied, such that it might be treated as a new link if
|
|
encountered with uppercase letters.
|
|
|
|
|
|
Files created and/or used with the -traversal switch, based on definitions
|
|
in userdefs.h:
|
|
|
|
TRAVERSE_FILE (traverse.dat):
|
|
Contains a list of all URLs that were traversed. Note
|
|
that if a URL appears in this file it will not be
|
|
traversed again (important if runs are started and
|
|
stopped). Placing an entry in this file BEFORE the
|
|
run will block traversal of that URL. Unlike reject.dat
|
|
a final * has no effect (see below). Note that Lynx
|
|
internal client-side image MAP URLs will be included in
|
|
this file (e.g., LYNXIMGMAP:http://server/foo.html#map1),
|
|
in addition to the "real" (external) http URLs.
|
|
|
|
TRAVERSE_FOUND_FILE (traverse2.dat):
|
|
Contains a list of all URLs that were traversed, in the
|
|
order encountered or re-encountered (but not re-travered)
|
|
during a traversal run, and the TITLEs of the documents
|
|
(separated from the URLs by TABs) A URL and TITLE may be
|
|
present in this list many times. To simplify the list,
|
|
on VMS use: sort/nodups traverse2.dat;1 ;2
|
|
Note that the URLs and TITLEs of the Lynx internal
|
|
client-side image MAP pseudo-documents will not be
|
|
included in this file, though "traversed", but only the
|
|
http URLs and TITLEs derived from the MAP's AREA tag
|
|
HREFs that were traversed.
|
|
|
|
TRAVERSE_REJECT_FILE (reject.dat):
|
|
Contains a list of URLs that have been rejected from the
|
|
traversal. Once a URL has been entered in this list, it
|
|
will not be traversed. URLs that end in a * will cause
|
|
rejection of all URLs that match up to the character before
|
|
the *. So for instance, to reject all htbin references on a
|
|
site put this line in the reject.dat file BEFORE starting
|
|
the run: http://www.site.wherever:8000/htbin*
|
|
|
|
TRAVERSE_ERRORS (traverse.errors):
|
|
A list of links that could not be accessed or had an
|
|
unknown status returned by the http server. If the
|
|
owner of the document containing the link is know via
|
|
a LINK REV="made" HREF="mailto:foo" in it and the
|
|
MAIL_SYSTEM_ERROR_LOGGING was set true in userdefs.h
|
|
or lynx.cfg (not recommended!!!), a message about the
|
|
problem will be mailed to the owner as well.
|
|
|
|
|
|
Files created during traversals if the -crawl switch is included with the
|
|
-traversal switch:
|
|
|
|
lnk########.dat Numbered output files containing the contents of traversed
|
|
hypertext documents in text format. All hypertext links
|
|
within the document have been stripped, and the URL and
|
|
TITLE of the document are recorded as the first two lines,
|
|
e.g., for the seqaxp.bio.caltech.edu home page the first
|
|
two lines will be:
|
|
|
|
THE_URL:http://seqaxp.bio.caltech.edu:8000/
|
|
THE_TITLE:SAF Web server home page
|
|
|
|
The VMSIndex software is being adapted to use this
|
|
information to extract the corresponding URL and TITLE
|
|
for use in indexing the lnk########.dat files, e.g.:
|
|
|
|
$ build_index -
|
|
/url=(text="THE_URL:") -
|
|
/topic=(text="THE_TITLE:",EXCLUDE) -
|
|
/output=INDEX_NAME -
|
|
lnk*.dat
|
|
|
|
A clever person should be able to figure out a way to
|
|
index the lnk########.dat files on Unix as well.
|
|
|
|
If you want the hypertext links in the document to be
|
|
numbered, include the -number_links switch. By default,
|
|
this will cause the list of References (URLs for the
|
|
numbered links) to be appended as well. If you want
|
|
numbered links but not the References list, include the
|
|
-nolist switch as well.
|
|
|
|
Note that any client-side image MAP pseudo documents
|
|
that were "traversed" will not have lnk########.dat
|
|
output files created for them, but output files will
|
|
be created for "real" documents that were traversed
|
|
based on the HREFs of the MAP's AREA tags.
|
|
|
|
This functionality is still under development. Feedback and suggestions
|
|
are welcome.
|