summaryrefslogtreecommitdiffstats
path: root/docs/CRAWL.announce
diff options
context:
space:
mode:
Diffstat (limited to 'docs/CRAWL.announce')
-rw-r--r--docs/CRAWL.announce131
1 files changed, 131 insertions, 0 deletions
diff --git a/docs/CRAWL.announce b/docs/CRAWL.announce
new file mode 100644
index 0000000..b38be12
--- /dev/null
+++ b/docs/CRAWL.announce
@@ -0,0 +1,131 @@
+The TRAVERSAL code from old versions of Lynx has been upgraded by David
+Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be
+implemented via a command line switch (-traversal) instead of via a
+compilation symbol for creating a separate Lynx executable as in those
+previous versions, and can be used in conjunction with a -crawl switch
+to make Lynx a front end for a Web Crawler.
+
+
+Usage:
+
+ lynx [-traversal] [-realm] [-crawl] ["startpage"]
+
+
+Added switches are:
+
+ -traversal Follow all http links derived from startpage that are
+ on the same server as startpage. If startpage isn't
+ specified then the traversal begins with the default
+ startfile or WWW_HOME.
+
+ -realm Further restrict http links to ones in the same realm
+ (having a matching base URI) as the startpage (e.g.,
+ http://host/~user/ will restrict the traversal to that
+ user's public html tree).
+
+ -crawl With [-traversal] outputs each unique hypertext page
+ as an lnk###########.dat file in the format specified
+ below. With [-dump] outputs only the startpage, in
+ the same format, to stdout.
+
+
+Note on startpage:
+
+ If a startpage is specified and contains any uppercase
+ characters, on VMS it should be enclosed in double-quotes.
+ The code that extracts the access and host fields from
+ startpage for comparisons with links to ensure they are
+ not on another server, and the comparisons with already
+ traversed links, are case sensitive, and the startpage
+ will go to all lowercase on VMS if no double-quotes are
+ supplied, such that it might be treated as a new link if
+ encountered with uppercase letters.
+
+
+Files created and/or used with the -traversal switch, based on definitions
+in userdefs.h:
+
+TRAVERSE_FILE (traverse.dat):
+ Contains a list of all URLs that were traversed. Note
+ that if a URL appears in this file it will not be
+ traversed again (important if runs are started and
+ stopped). Placing an entry in this file BEFORE the
+ run will block traversal of that URL. Unlike reject.dat
+ a final * has no effect (see below). Note that Lynx
+ internal client-side image MAP URLs will be included in
+ this file (e.g., LYNXIMGMAP:http://server/foo.html#map1),
+ in addition to the "real" (external) http URLs.
+
+TRAVERSE_FOUND_FILE (traverse2.dat):
+ Contains a list of all URLs that were traversed, in the
+ order encountered or re-encountered (but not re-travered)
+ during a traversal run, and the TITLEs of the documents
+ (separated from the URLs by TABs) A URL and TITLE may be
+ present in this list many times. To simplify the list,
+ on VMS use: sort/nodups traverse2.dat;1 ;2
+ Note that the URLs and TITLEs of the Lynx internal
+ client-side image MAP pseudo-documents will not be
+ included in this file, though "traversed", but only the
+ http URLs and TITLEs derived from the MAP's AREA tag
+ HREFs that were traversed.
+
+TRAVERSE_REJECT_FILE (reject.dat):
+ Contains a list of URLs that have been rejected from the
+ traversal. Once a URL has been entered in this list, it
+ will not be traversed. URLs that end in a * will cause
+ rejection of all URLs that match up to the character before
+ the *. So for instance, to reject all htbin references on a
+ site put this line in the reject.dat file BEFORE starting
+ the run: http://www.site.wherever:8000/htbin*
+
+TRAVERSE_ERRORS (traverse.errors):
+ A list of links that could not be accessed or had an
+ unknown status returned by the http server. If the
+ owner of the document containing the link is know via
+ a LINK REV="made" HREF="mailto:foo" in it and the
+ MAIL_SYSTEM_ERROR_LOGGING was set true in userdefs.h
+ or lynx.cfg (not recommended!!!), a message about the
+ problem will be mailed to the owner as well.
+
+
+Files created during traversals if the -crawl switch is included with the
+-traversal switch:
+
+lnk########.dat Numbered output files containing the contents of traversed
+ hypertext documents in text format. All hypertext links
+ within the document have been stripped, and the URL and
+ TITLE of the document are recorded as the first two lines,
+ e.g., for the seqaxp.bio.caltech.edu home page the first
+ two lines will be:
+
+ THE_URL:http://seqaxp.bio.caltech.edu:8000/
+ THE_TITLE:SAF Web server home page
+
+ The VMSIndex software is being adapted to use this
+ information to extract the corresponding URL and TITLE
+ for use in indexing the lnk########.dat files, e.g.:
+
+ $ build_index -
+ /url=(text="THE_URL:") -
+ /topic=(text="THE_TITLE:",EXCLUDE) -
+ /output=INDEX_NAME -
+ lnk*.dat
+
+ A clever person should be able to figure out a way to
+ index the lnk########.dat files on Unix as well.
+
+ If you want the hypertext links in the document to be
+ numbered, include the -number_links switch. By default,
+ this will cause the list of References (URLs for the
+ numbered links) to be appended as well. If you want
+ numbered links but not the References list, include the
+ -nolist switch as well.
+
+ Note that any client-side image MAP pseudo documents
+ that were "traversed" will not have lnk########.dat
+ output files created for them, but output files will
+ be created for "real" documents that were traversed
+ based on the HREFs of the MAP's AREA tags.
+
+This functionality is still under development. Feedback and suggestions
+are welcome.