Subsections

9. WWW and FTP

9.1 Available browsers

On most machine the default browser is Netscape Navigator or Mozilla. However on many workstations you can choose also Galeon (Gnome), Konqueror (KDE) or lynx (text-only browser). There are no special issues with any of this browsers. Please note only that when clicking on a mailto: URL each of them will start a different mail program (the internal mailer for Netscape/Mozilla, Evolution for Galeon, KMail for Konqueror) so you risk to end up with messed mail folders if you use too many of them.

The web without browsers

Sometimes it can be useful to download information from the web without having to interact with a browser: maybe you need to download a large number of related small pages, or you find the web server is terribly slow, except from 3 to 5AM... There are a few programs that can help, downloading in the background for you and possibly saving the result.

GET
is the simplest possible client: it is invoked with one or more URLs on the command line and outputs download results to stdout. It is not of much use for batch download, but can be used to track down problems with webservers. See also HEAD
Wget
(http://www.wget.org/) performs non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. It can parse downloaded files for links and recursively download all linked pages (be careful, since this can lead to unwanted massive download! see man wget for recursive downloading options)
cURL
(http://curl.haxx.se/) is a client to get documents/files from or send documents to a server, using any of the supported protocols (HTTP, HTTPS, FTP, GOPHER, DICT, TELNET, LDAP or FILE). The command is designed to work without user interaction or any kind of interactivity. curl offers a busload of useful tricks like proxy support, user authentication, ftp upload, HTTP post, SSL (https:) connections, cookies, file transfer resume and more. There is a large help for curl in man curl

9.2 Setting up a web proxy

The use of a web proxy can improve both tranfer speed and response time when accessing many popular web sites, while reducing the total traffic on SISSA external link. This is achieved keeping a local copy of frequently visited pages. The local copy is validated before being served to the client, i.e. compared to a set of refresh patterns to ensure the information is still up-to-date.

You should configure your browser to use the proxy with the automatic configuration script provided at http://proxy.sissa.it/cgi-bin/proxy.pac. The script will configure your browser for using the proxy farm, with load balancing and high availability in case of failure of one of the proxy boxes.

Browser-specific information

If you encounter any problem, please fill in the error reporting forms provided with proxy error messages with as much information as you can. Your reports will help us to track down existing problems and improve service quality.

Proxy Q&A

Q:
I have 20Gb of free space on my workstation available for browser cache; I never use another browser, nor any other workstation. Why should I set up my browser to use a proxy?
A:
There are at least three good reasons:
  1. the proxy cache is shared among all users: this means that you can benefit from downloads made by others at SISSA, and they can benefit from yours
  2. the proxy software is entirely devoted to caching and uses better algorithms than your browser to find out if a cached page is still fresh or should be regarded as stale, so you will get a higher hit rate with lower risk of being served stale contents
  3. most browsers are tuned to work with small to mid-size caches (tipically of the order of 10Mb) and tend to perform badly with very large caches: the time needed to search the local cache can be larger than the one needed to download the page itsef; on older machines the browser can actually crash or even make the whole system unstable
Q:
I'm going to browse a SSL-enabled site, and I do not want reserved information to be cached. Should I disable the proxy in my browser?
A:
encrypted connections are proxied but not cached - in fact caching encrypted information is useless since nobody will be able to decrypt it but the intended recipient. However some sites serve only sensitive informations over SSL, while other page components (e.g. navigation bars) are served unencrypted and these can be usefully cached, so you can safely use the proxy.

9.3 File transfer

Web browsers can use the file transfer protocol for anonymous ftp, as the wget and curl described above. They can also perform non-anonymous ftp, but the password is sent as plain text over the network, and sometimes it is shown on screen too! You should better use sftp and scp that use encrypted connections for both authentication and data transfer. While scp is non-interactive and lets you tranfer files with a single command line, sftp is an interactive client with the same user interface of plain old ftp client.

scp and sftp quick reference
scp path/file user@host:path/file copy local file to remote host
scp user@host:path/file path1/file1 copy remote file to local filesystem
scp u1@h1:path1/file1 u2@h2:parh2/file2 copy between remote hosts
scp -r ... copy recursively entire directories
scp -C ... enable compression - this will speed up some tranfers on slow lines
sftp user@host start the sftp client and login to specified host
sftp -C ... start the sftp client enabling compression


Within the sftp client all standard ftp commands are available, see man sftp or the online help typing help at the sftp> prompt. Remember that wildcards are expanded by the (local) shell, unless they are single quoted so scp file*.tex user@host:/tmp works as expected, but you need to write scp user@host:'/home/user/file*.tex' /tmp to perform the opposite: without the single quotes the shell would have locally expanded the file*.tex expression, probably resulting in some error or at least in something different from what you wanted.

Piero Calucci 2004-11-05