Program Overview
Today's Schedule
- Begin at 8:30am
- 15 minute break at 10:00am
- Lunch at 11:30am
- Restart at 12:30pm
- 15 minute break at 1:45pm
- End at 3:30pm
Order of topics
- Program Overview
- What are proxies?
- Definitions
- Proxies for Bandwidth Conservation
- How Proxies Work
- Local PC Setup
- Proxies for Statistics
- Privacy Statements
- Proxies for Filtering
- Authentication systems
- Proxies for Remote Resource Access
- Free Proxy Servers
- Commercial Proxy Servers
- Future Developments
- Wrap-up / Evaluation
- Resources
Facility Logistics
- Program Logistics
- Questions
- Handouts
 |
Slides which have been added since the handout was produced |
 |
Slides which have changed since the handout was produced |
Orientation Questions
How many sites have deployed proxy servers now or are in the planning stages? For what reasons?
Orientation Questions
What drew you to this Regional Institute? What do you hope to get out of it at the end of the day?
What are proxies?
A proxy is a service that sits between web servers (or, more accurately, "origin web servers") and clients. This service receives requests from clients and makes requests to servers on behalf of the clients.
A caching proxy, in addition to the above, saves a copy of the HTML pages, graphics, and other resources as they pass through.
Traditional reasons to use caching proxies
- To reduce latency...
Latency is the delay between when a client makes a request and the entire response is received.
If clients on a LAN are using the services of a caching proxy server on the LAN, the caching proxy server will always be able to respond faster to requests for objects.
- To reduce traffic...
Since copies of an HTML page or a graphic are stored locally, any request for that item after the first one can be served using the proxy's store of files.
- Clarification:Browsers also have caches of files. Proxy caches refer to an entirely separate service.
Library Scenarios
Bandwidth Conservation
- Your access to the Internet is slow, and you've determined that it is because the line is near capacity. You know a lot of people are using web sites, and you suspect they are using similar web sites. Do you need to purchase additional bandwidth from your Internet Service Provider?
- Your network connection to a branch site is slow and congested. Much of the access at the branch site is for resources hosted on your main network. Can you stretch the useful lifetime of that old line any further?
Statistics
- Budget time again, and it seems like your electronic resource line is taking up more and more money. How do you know what your patrons are using and if usage patterns are worth the cost of a resource?
Filtering
- You have a limited number of workstations but an unlimited number of people who want to play games, send e-mail, or chat. The library has formed a policy which says these activities are prohibited. Can you implement this policy?
- You want to dedicate some of your workstations to just library catalog, database and/or e-journal access. Can that be done without hovering over patrons?
Remote Resource Access
- You have invested all of this money in electronic resources, but database vendors offer you one of two ways to provide access: either by issuing usernames and passwords or by restricting access to a range of Internet addresses. But distributing passwords to all of your clients is awkward yet they all want access from their homes, offices, or foreign countries. Can you get them access?
Definitions
In the beginning...
- User
- A person making a request.
- Resource
- A document, graphic, file, or other object stored or generated which can be addressed with an URL.
- Entity
- The content transferred between clients and servers, encompassing the metadata and the resource itself.
- URL
- Uniform Resource Locator -- the address of an entity on the network.
Roles
- Client
- Software that will retrieve an entity from a network server.
- Server
- Software that accepts a request for an entity and return the entity to the client.
- Origin Server
- The server which holds the original copy of an entity.
- Request
- The formal process by which a client asks for an entity from a server.
- Response
- The formal process of a server returning an entity or information about an entity to a client.
Proxies
- Proxy
- A server which accepts entity requests from clients and retrieves that entity from the origin server or another proxy server.
- Transparent Proxy
- A proxy that passes requests and responses unmodified, except as required for proxy authentication.
- Non-transparent Proxy
- A proxy that somehow modifies the request or response to provide some added value to the client or user.
- Cache
- A collection of entities (resources and metadata) that can be used as responses to client requests.
- Caching Proxy
- A proxy server with a cache. Sometimes referred to as "proxy caches" or simply "caches", but the term "proxy" is often misinterpreted to include the "caching" component.
Authentication/Authorization
- Authentication
- "The process where a network user establishes a right to an identity." (Lynch)
- Authorization
- "The process of determining whether an identity ... is permitted to perform some action." (Lynch)
- Access management
- "Systems that may make use of both authentication and authorization services in order to control use of a networked resource." (Lynch)
- Credentials
- The right of an authenticated, authorized user to perform a function.
- Firewall
- Hardware and/or software used to protect hosts on one network segment from accesses on another.
Proxies for Bandwidth Conservation
Set up proxy server
- Use a Transparent Proxy server
- In this case, we'll use Apache running under Linux
Building Apache
+--------------------------------------------------------+
| You now have successfully built and installed the |
| Apache 1.3 HTTP server. To verify that Apache actually |
| works correctly you now should first check the |
| (initially created or preserved) configuration files |
| |
| /usr/local/apache-proxy/conf/httpd.conf
| |
| and then you should be able to immediately fire up |
| Apache the first time by running: |
| |
| /usr/local/apache-proxy/bin/apachectl start
| |
| Thanks for using Apache. The Apache Group |
| http://www.apache.org/ |
+--------------------------------------------------------+ |
- Download http://httpd.apache.org/dist/apache_1.3.23.tar.gz
- gunzip apache_1.3.23.tar
- tar xf apache_1.3.23.tar
- cd apache_1.3.23
- ./configure --with-layout=Apache --prefix=/usr/local/apache-proxy --enable-module=most --enable-shared=max
- make
- make install
Apache httpd.conf configuration
#
# Proxy Server directives. Uncomment the following lines to
# enable the proxy server:
#
<IfModule mod_proxy.c>
ProxyRequests On
Listen 4545
<Directory proxy:*>
Order deny,allow
Deny from all
Allow from .your_domain.com
</Directory>
#
# Enable/disable the handling of HTTP/1.1 "Via:" headers.
# ("Full" adds the server version; "Block" removes all outgoing Via: headers)
# Set to one of: Off | On | Full | Block
#
ProxyVia On
#
# To enable the cache as well, edit and uncomment the following lines:
# (no cacheing without CacheRoot)
#
CacheRoot "/usr/local/apache-proxy/proxy"
CacheSize 5
CacheGcInterval 4
CacheMaxExpire 24
CacheLastModifiedFactor 0.1
CacheDefaultExpire 1
#NoCache a_domain.com another_domain.edu joes.garage_sale.com
</IfModule>
# End of proxy directives.
Starting Apache
- From the root directory of the Apache installation, run: bin/apachectl start
Client configuration
- Each web browser has a specific place where you can enter Proxy settings
- Netscape 4.x/6.x and Mozilla: "Edit" -> "Preferences" -> "Advanced" -> "Proxies"
- Microsoft IE 5.x/6.x (Windows only): "Tools" -> "Internet Options" -> "Connections" -> "LAN Settings" -> "Use a Proxy Server"
- Selections for "Address" and "Port"
Demonstration
- To verify...
- http://www.ohiolink.edu/cgi-bin/whereami.pl
How Proxies Work
Networking overview
Data Link (or Network Interface) Layer
Where the bits meet the wire...
- Standards
- Ethernet, FDDI, ATM, Wireless, Dialup, ISDN, DSL
Internet Layer
The IP (Internet Protocol) of TCP/IP
- Addressing
- The uniquely defined IP address for each machine.
- Example
- 123.234.123.234
Transport Layer
The TCP (Transmission Control Protocol) of TCP/IP
- Addressing
- "Ports" for each service on a machine
- Example
- 80: http services
- Also found
- UDP: User Datagram Protocol
Application Layer
- Enough already! Let's get some work done...
- Programs that ask and receive services from the network
- Web clients/servers, e-mail, FTP, RealAudio, etc.
Layers on top of layers
- Data Link Layer
- Internet Layer
- Transport Layer
- Application Layer
- On which layer would you find switches, hubs, and repeaters?
- On which layer would you find routers?
- On which layer would you find firewalls and gateways?
- On which layer would you find proxies?
Overview of HTTP/1.0 and HTTP/1.1 protocols
Formats of messages
- HTTP messages have the following structure
- Request/Response Line
- Zero or more headers
- A blank line
- Zero or one entities
- Request lines have the form: <Method> <Address> <Protocol-Version>
- Response lines have the form: <Protocol-Version> <Response-code> <Text-Message>
- Header lines have the form: <Header-name>: <Value>
- Lines end with Carriage-Return/Line-Feed combinations.
Examples of messages
- GET /site-map.html HTTP/1.0
- Host: www.college.edu
-
- HTTP/1.0 200 OK
- Date: Wed, 19 Apr 2000 16:37:29 GMT
- Server: Apache/1.3.12 (Unix) PHP/3.0.16
- Content-Type: text/html
-
- <HTML>
- <HEAD>
- ...
Listing of HTTP Response Codes
- 1xx: Informational
- Request received, continuing process
- 2xx: Success
- The action was successfully received, understood, and accepted
- 200 OK
- 3xx: Redirection
- Further action must be taken in order to complete the request
- 301 Moved Permanently
- 304 Not Modified
- 305 Use Proxy
- 4xx: Client Error
- The request contains bad syntax or cannot be fulfilled
- 401 Unauthorized
- 403 Forbidden
- 404 Not Found
- 407 Proxy Authentication Required
- 5xx: Server Error
- The server failed to fulfill an apparently valid request
Adapted from RFC2616
Format of Proxy Request messages
- Similar to requests to the origin server, except that the method and host/port are included in the request
- Specialized proxy-related headers can also be used.
- Request lines have the form: <Method> <Address> <Protocol-Version>
- GET http://www.college.edu/site-map.html HTTP/1.0
- Host: www.college.edu
- Cache-control: no-cache
-
Cache headers
General
Headers that can be used in either HTTP Requests or Responses.
- Date
- Date and time the message was created. Format of the field must be as described in RFC1123 (e.g. "Tue, 15 Nov 1994 08:12:31 GMT").
- Pragma
- Used to pass special directives in request and response messages. One commonly used Pragma header is "no-cache", but this is being phased out in favor of the "Cache-control" header.
- Via
- Indicates the chain of proxy servers used to forward the message. Proxy servers must specify the protocol/version and the hostname or pseudonym of the server.
Entity
Headers which described an resource (can be used in either HTTP Requests or Responses).
- ETag
- The "Entity Tag" for the resource. Entity tags are unique identifiers for a specific version of an resource, and can be used with the "If-Match" and "If-None-Match" request headers to determine when an resource changes.
- Expires
- Specifies the date and time the entity expires. Cached copies of an entity should not be used after this time without revalidation.
- Last-modified
- Specifies the last modification date and time of the resource on the origin server.
Request
Headers used only in HTTP Requests.
- Host
- The Internet (DNS) host name and port number from the URL of the resource being requested.
- Authorization
- Used to send the user's credentials to the origin server. Typically a client will resend a request with an "Authorizations" header when it receives a "401 Unauthorized" status code and a "WWW-Authenticate" response header in response to a URL request.
- Proxy-Authorization
- Used to send the user's credentials to the chain of proxy servers servicing the request.
- If-modified-since
- Used in a request to make it conditional: if the requested resource has not been modified since the time specified in this field, the resource will not be returned from the server; instead, a 304 "Not modified" response will be returned without any message-body.
- If-match
- In combination with the "ETag" entity header, the server will return an 412 "Precondition failed" if the ETag of the entity being requested is different from the ETag of the entity on the server
- If-none-match
- The inverse of the "If-match" header operation. If the ETag of the entity being requested matches the ETag of the entity on the server, the server returns a 304 "Not Modified" status.
Response
Headers used only in HTTP Responses.
- WWW-Authenticate
- Returned by the server when credentials are required to retrieve an resource, but the credentials have not been supplied or are invalid. The header contains parameters instructing the client how to respond with a correctly formatted "Authorization" header.
- Proxy-authenticate
- Returned by a proxy when credentials are required to use the services of the proxy.
- Warning
- Used by an origin server or a proxy to send more detailed warning messages back to the client.
Cache control
The "Cache-control" general header was introduced in HTTP/1.1 to consolidate and further refine the
caching policies and requests in client caches and proxy caches. The "Cache-control" header defines
different directives depending on whether it is used in a request or response context.
Cache control request directives
- no-cache
- Requests an end-to-end revalidation -- the origin server should be reached through the chain of proxies to determine whether the entity is up-to-date.
- no-store
- An intermediate proxy must not store any part of the request or response on non-volatile (e.g. disk) media.
- max-age=seconds
- The client specifies a maximum age of the entity (in seconds) that it will accept out of a proxy's cache before the origin server must be contacted.
- max-stale
- The client specifies that it is willing to accept an entity that the cache has determined is past its freshness lifetime.
- max-stale=seconds
- As above, but the client specifies a maximum number of seconds beyond a freshness lifetime.
- min-fresh=seconds
- The client requires that the entity in the proxy cache must have at least the specified number of seconds left before the freshness lifetime expires.
- only-if-cached
- Requests that the proxy return the entity only if it can be served from the proxy's cache.
Cache control response directives
- public
- The server specifies that the response is cacheable in any cache (client or proxy cache).
- private
- The response is intended for the specific client only and cannot be cached by any shared caches.
- no-cache
- The response is uncacheable and must not be stored in either client or proxy caches.
- no-store
- The response cannot be stored on any non-volatile media. This usually means that the entity can only be stored in memory and never to disk, where it is susceptible to compromise.
- no-transform
- Intermediate proxy servers must not perform any transformations on the entity.
- max-age=seconds
- The origin server specifies a freshness lifetime for the entity, overriding lifetime values determined by the proxy caches.
Demonstrations
A look at the headers and cacheability parameters of three sites: www.ala.org, www.whitehouse.gov, and www.cnn.com.
Display HTTP Headers
View of HTTP headers using the services of http://www.web-caching.com/showheaders.html.
Cacheability Query
Overview of cacheability of HTML and referenced resources using the services of http://www.ircache.net/cgi-bin/cacheability.py.
Local PC Setup
Manual Configuration
- Nearly all browsers include a proxy client function
- From the earlier example, we saw that each web browser has a specific place where you can enter Proxy settings
- Netscape 4.x/6.x and Mozilla: "Edit" -> "Preferences" -> "Advanced" -> "Proxies"
- Microsoft IE 5.x/6.x (Windows only): "Tools" -> "Internet Options" -> "Connections" -> "LAN Settings" -> "Use a Proxy Server"
- Lynx users may specify a proxy in an environment variable or configuration file
Proxy Auto-Configuration Files
- Proxy information is downloaded from a server when the browser starts
- Includes considerable flexibility in instructing to browser to use various proxy servers or a direct Internet connection.
- Introduced with Navigator 2.x
- Now supported in Navigator and Internet Explorer
Format
Programming
- The FindProxyForURL function receives two parameters for each URL requested: "url" and "host"
- url
- The complete URL being requested.
- host
- The hostname extracted from the URL. This is the exact same host listed in the URL. The port number is not included (it can be extracted from the URL if needed).
- The FindProxyForURL function must return a string in one of three formats:
- DIRECT
- The request should go directly to the origin server.
- PROXY host:port
- The request should go to the specified proxy.
- SOCKS host:port
- The specified SOCKS server should be used.
- More than one method may be used; separate different methods by semicolons
Helpful Pre-defined Functions
- isPlainHostName(host)
- True if and only if there is no domain name in the hostname (no dots).
- host
- the hostname from the URL (excluding port number).
- shExpMatch(str, shexp)
- Returns true if the string matches the specified shell expression.
- str
- is any string to compare (e.g. the URL, or the hostname).
- shexp
- is a shell expression to compare against.
Adapted from Navigator Proxy Auto-Configure File Format
Example -- All clients through one proxy server
function FindProxyForURL(url,host) {
// If the host requested on the URL line is not a FQDN
// (eg, it is 'www'), then don't proxy.
if (isPlainHostName(host)) {
return "DIRECT";
}
// Otherwise, send through proxy
return "PROXY proxy.college.edu:4545";
}
}
Example -- Some clients through one proxy server for all requests
function FindProxyForURL(url,host) {
// If the host requested on the URL line is not a FQDN
// (eg, it is 'www'), then don't proxy.
if (isPlainHostName(host)) {
return "DIRECT";
}
// Make OPAC stations go through Proxy Server
if (myIpAddress() == "10.243.20.242" ||
myIpAddress() == "10.243.21.210" ||
myIpAddress() == "10.243.21.241" ||
myIpAddress() == "10.243.22.13" ||
myIpAddress() == "10.243.22.19" ||
myIpAddress() == "10.243.22.35" ||
myIpAddress() == "10.243.22.41" ||
myIpAddress() == "10.243.22.182") {
return "PROXY proxy.college.edu:4545";
}
// Everyone else can go directly to the origin server
return "DIRECT";
}
Example -- All clients through two proxy servers for some requests
function FindProxyForURL(url,host) {
// If the host requested on the URL line is not a FQDN (eg, it is 'www'),
// then don't proxy.
if (isPlainHostName(host)) {
return "DIRECT";
}
// Now do the list of IP-restricted services; they go through the proxy
if (shExpMatch(host, "*eb.com") || shExpMatch(host, "*oclc.org")) {
return "PROXY proxy1.college.edu:4545; PROXY proxy2.college.edu:4545";
}
// Otherwise, go directly to the origin server
return "DIRECT";
}
Automatic Detection
- In Internet Explorer 5.x for Windows, Microsoft introduced "Web Proxy Auto-Detect" (WPAD)
- To set up
- Create a DNS entry (either an "A" record or a "CNAME" record) for "wpad.college.edu" which points to a web server.
- On that web server, create a file called "wpad.dat" on the root level that contains the Proxy Autoconfig script.
- Set browsers to "Automatically detect settings" in the proxy setup screen.
- On startup, the browser will attempt to fetch http://wpad.college.edu/wpad.dat to learn settings
- Submitted to the Internet Engineering Task Force (IETF) for consideration as an RFC
Proxies for Statistics
Set up proxy server
- Once again, a transparent proxy server is all that is required.
- Point the clients you would like to monitor at the proxy server.
Statistics programs
- A wide variety of statistics programs are available
- A link to a list is included in the Resources section
- My favorite is Analog, a highly configurable, common program for creating web statistics
Demonstration
4.54.39.182 - - [01/Aug/2000:00:17:10 -0500] "GET http://rave.ohiolink.edu/databases/login/abig HTTP/1.0" 302 296
4.54.39.182 - - [01/Aug/2000:00:17:11 -0500] "GET http://olc7.ohiolink.edu/cgi-bin/login/abig HTTP/1.0" 302 257
4.54.39.182 - - [01/Aug/2000:00:17:13 -0500] "GET http://olc7.ohiolink.edu/bin/gate.exe?f=login&p_lang=english&p_d=abig HTTP/1.0" 302 247
4.54.39.182 - - [01/Aug/2000:00:17:19 -0500] "GET http://olc7.ohiolink.edu/bin/gate.exe?f=search&state=3v0758.1.1 HTTP/1.0" 200 7164
4.54.39.182 - - [01/Aug/2000:00:17:20 -0500] "GET http://olc7.ohiolink.edu/style/dw.css HTTP/1.0" 304 -
4.54.39.182 - - [01/Aug/2000:00:17:40 -0500] "GET http://olc7.ohiolink.edu/bin/gate.exe?f=brwsidx&state=3v0758.1.1&p_IdxPara=JNBR&p_IdxTerm= HTTP/1.0" 200 2349
4.54.39.182 - - [01/Aug/2000:00:17:47 -0500] "GET http://olc7.ohiolink.edu/cgi-bin/submit-brws?f=brwsidx&state=3v0758.2.1&p_L=8&p_IdxTerm=The+Economist HTTP/1.0" 302 286
4.54.39.182 - - [01/Aug/2000:00:18:41 -0500] "GET http://olc7.ohiolink.edu/bin/gate.exe?f=brwsidx&state=3v0758.2.1&p_L=8&p_IdxTerm=The-Economist HTTP/1.0" 200 5323
4.54.39.182 - - [01/Aug/2000:00:19:14 -0500] "GET http://olc7.ohiolink.edu/bin/gate.exe?f=logout&state=3v0758.3.1&p_goto=cdb HTTP/1.0" 302 224
Privacy Statements
What is collected?
4.54.39.182 - - [01/Aug/2000:00:17:19 -0500] "GET http://olc7.ohiolink.edu/bin/gate.exe?f=search&state=3v0758.1.1 HTTP/1.0" 200 7164
- A proxy server will typically use the Common Log Format:
- host
- The fully-qualified domain name of the client, or its IP number if the name is not available or DNS resolution is turned off.
- ident
- The identity information reported by the client (if running identd).
- authuser
- If the request was for an password protected document, then this is the user id used in the request.
- date
- The date and time of the request, in the following format: [day/month/year:hour:minute:second zone]
- request
- The request line from the client, enclosed in double quotes.
- status
- The three digit status code returned to the client.
- bytes
- The number of bytes in the object returned to the client, not including any headers.
- Other information, such as origin server information, response time, and browser strings, may also be logged.
Portions adapted from Apache Module mod_log_common documentation.
Sensitivity of information
- Since all requests for a particular user can go through the proxy server, the proxy server's logs provide a much more complete view of patron activity than any single web server log
- The key is to distill log files to remove any identifying information and aggregate server requests to remove patterns.
Forming a privacy statement
- Library policy may already specify policy for log files.
- Trust*E recommendations:
- What personally identifiable information of yours or third party personally identification is collected from you through the web site
- The organization collecting the information
- How the information is used
- With whom the information may be shared
- What choices are available to you regarding collection, use and distribution of the information
- The kind of security procedures that are in place to protect the loss, misuse or alteration of information under [NAME OF INSTITUTION] control
- How you can correct any inaccuracies in the information.
Disposition of Log Files
- Adapted Trust*E section on log files:
Log Files
We use IP addresses to analyze trends, administer the site, and gather broad demographic information for aggregate use. IP addresses are not linked to personally identifiable information. Your login ID/barcode may be linked to specific log entries, but this identification is removed before statistics are generated.
Use of logs to identify problems and service abuses
- Log files can show if a user's credentials are being used by more than one person
- Likely a violation of the vendor's database agreement and an unauthorized use of your resources
- Things to look for:
- Near simultaneous use of a set of credentials for a wide variety of databases
- Wide geographic use of a set of credentials over a short period of time
- "Excessive" use of one set of credentials
Proxies for Filtering
- What this section is...
- ...a discussion of technology solutions to fulfill policy requests
- What this section is not...
- ...a debate about the pros and cons of Internet filtering
- ...a review of strictly content-based filters for library workstations
Fake a proxy server to prevent access to all but authorized sites
- Workstations that are to be used only to access particular sites.
1. Invalid Proxy Server
- Put in an invalid proxy server and then "exclude" the lists of sites for which you want to allow access. (Andrew Mutch of The Library Network [Michigan])
- Users receive a browser message that the host name "This host is limited to..." could not be found
2. Fake Proxy Server
- #1 is good, but the error message is vague and confusing.
- Run a fake proxy server on a specific port on a UNIX box which simply displays an HTML page.
- Create a HTTP-response-in-a-file (/usr/local/sorry.cat-html in this example):
HTTP/1.0 200 Ok
Content-type: text/html
<HTML>
<HEAD><TITLE>Can't go there</TITLE></HEAD>
<BODY><P>Sorry -- you can't get there from this workstation.</P></BODY>
</HTML>
- Add a line to your services file: fakeproxy 8080/tcp
- Add a line to your inetd.conf file: fakeproxy stream tcp nowait httpusr /bin/cat cat /usr/local/sorry.cat-html ...and restart your inetd server with a HUP signal.
3. Use a PAC file
Gaming / Web-based E-mail, Chat / etc.
- Dan Lester's site of Chat, Web Email, and Game Playing Sites
The following sites provide services that some libraries wish to block. These are not blocked on content, but on the type of service they provide. Blocking these sites enables libraries to keep their public web stations available for research and information retrieval. The intention is to block only the parts of sites that provide the forbidden services.
Setting up Apache to block requests to certain sites
ProxyBlock
- Syntax
- ProxyBlock <word/host/domain list>
- Compatibility
- ProxyBlock is only available in Apache 1.2 and later.
The ProxyBlock directive specifies a list of words, hosts and/or domains, separated by spaces. HTTP, HTTPS, and FTP document requests to sites whose names contain matched words, hosts or domains are blocked by the proxy server. The proxy module will also attempt to determine IP addresses of list items which may be hostnames during startup, and cache them for match test as well. Example:
ProxyBlock joes-garage.com some-host.co.uk rocky.wotsamattau.edu
'rocky.wotsamattau.edu' would also be matched if referenced by IP address.
Adapted from Apache Module mod_proxy documentation.
Other forms of filtering
- Filtering based on Content-type headers; browser string headers
- Prevent certain file types from being downloaded or stop specified web browsers from functioning.
- Removing Cookie headers, ad graphics
- Block personalization functions and web advertising. Protect privacy by removing identifying information from requests.
- Virus protection on web requests
- Scan incoming files for viruses to prevent them from being downloaded to local machines.
- Character code translation
- Translate web pages on the fly into different character sets.
Authentication systems
How web clients authenticate to servers
- The most popular form is called "Basic Authentication"
- The client makes a normal request for a page. The server determines that authentication is required for that page.
- The server returns a WWW-Authenticate header, and the browser displays a login box with the realm string supplied by the server.
WWW-Authenticate: Basic realm="WallyWorld"
- The browser accepts the login and password from the user, creates a string in the form "<login>:<password>", encodes it with Base-64, and sends that in an Authorization header back to the server with the same URL request.
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
- The server decodes the Base-64 string, separates the login and password, and checks the credentials.
- Problem? Easy to decrypt!
- Other forms exist, but are not widely supported.
- Recommendation?
Sources for authentication
- Integrated Library Systems
- Barcode recognition
- Flat-file username/password
- Login test against POP/IMAP servers
- LDAP directory service
- Kerberos, Netware NDS, Microsoft Networking
What can Apache handle?
Search the Apache Module Registry for "authentication":
http://modules.apache.org/search?search=Authentication&query=true
- NT Domain
- LDAP
- Kerberos
- Radius
- TACACS+
- Various databases
- "External Authentication"
Proxies for Remote Resource Access
Basic Theory
- Vendors use IP address limitations; Site puts a proxy server in the range of valid IP addresses
- Just because you can do it doesn't mean it is legal
- Bandwidth considerations: traffic may pass through your Internet connection twice!
Is this a "reverse proxy"?
No. The term "Reverse Proxy" is used to describe the functions of a proxy server that sits between the Internet and the origin server.
- Mimics the functions of the origin server to such an extent that clients don't know they are not talking to the origin server.
- Used to distribute requests, provide a fail-safe service, or protect internal servers.
Authentication step
- This is the most difficult step. How do we allow only authorized users to access the proxy server, and consequently the remote databases?
Transparent versus non-transparent proxy servers
- Transparent Proxy
- A proxy that passes requests and responses unmodified, except as required for proxy authentication.
- Non-transparent Proxy
- A proxy that somehow modifies the request or response to provide some added value to the client or user.
- Rewriting Proxies
- A special form of the Non-transparent Proxy server which examines the URLs in HTML documents passing through the proxy, and rewrites them to point back to the proxy server
- "http://firstsearch.oclc.org/dbname=WorldCat;graphics=low;FSIP" becomes "http://proxy.college.edu/firstsearch/dbname=WorldCat;graphics=low;FSIP" or "http://proxy.college.edu:2049/dbname=WorldCat;graphics=low;FSIP" or "http://80-firstsearch.oclc.org.proxy.college.edu/dbname=WorldCat;graphics=low;FSIP"
Advantages, Disadvantages
- Transparent Proxy servers...
- ...are less computing intensive because they do not examine the content of each HTML page.
- ...are easier to program than Rewriting Proxies.
- ...require users to reconfigure their browsers (education problem).
- ...may not work with some corporate or commercial Internet Service Providers.
- Rewriting Proxy servers...
- ...require no changes to the user's browser and work with browsers on firewalled networks.
- ...sensitive to "incorrect" HTML.
- ...may not work with sites using sophisticated JavaScripts.
EZproxy Demonstration
- mkdir /usr/local/ezprozy
- cd /usr/local/ezproxy
- Download http://www.usefulutilities.com/ezproxy/ezproxy.bin
- mv ezproxy.bin ezprozy
- chmod 755 ezproxy
- ./ezproxy -m
- ./ezproxy -c
- ./ezproxy
- Point web browser to http://proxy.college.edu:2048/
Login/Password: testuser/testpass
What's happening here?
URLs are being rewritten on the page to point back through the EZproxy server.
http://www.altavista.com/ maps to
http://concerto.law.uconn.edu:2050/
http://doc.altavista.com/help/search/search_help.shtml maps to
http://concerto.law.uconn.edu:2052/help/search/search_help.shtml
http://shopping.altavista.com/ maps to
http://concerto.law.uconn.edu:2054/
http://tools.altavista.com/ maps to
http://concerto.law.uconn.edu:2055/
Mappings are stored in the "ezproxy.hst" file.
EZproxy's new scheme
Last year, a version of EZproxy was released with a new scheme for rewriting URLs. Using wildcard DNS entries and the "Host:" header, URLs can be rewritten as such:
http://www.altavista.com/ maps to
http://80-www.altavista.com.ezproxy.law.uconn.edu/
http://doc.altavista.com/help/search/search_help.shtml maps to
http://80-doc.altavista.com.ezproxy.law.uconn.edu/help/search/search_help.shtml
How is this better?
- Eliminates non-standard ports; allows EZproxy services to be used through restrictive corporate firewalls
- Reduces resource requirements for EZproxy server (fewer ports)
What do you give up?
- Can no longer run EZproxy and a standard web server on the same machine
Adding a new site
- Edit ezproxy.cfg to add:
T Database Title
U http://url.to.database/search/
D domains.used.by.vendor.com
D another.domain.com
- Construct the URL on your web page http://proxy.college.edu:2048/login?url=http://somedb.com/search
T LegalTRAC from Gale
U http://infotrac.galegroup.com/itweb/nellco_main
D galegroup.com
Adding authentication
- By barcode pattern
- Add line to ezproxy.usr:
28888#######::
- By text file
- Add line to ezproxy.usr:
::file=myusers.txt
- By IMAP server login
- Add line to ezproxy.usr:
::imap=imapserver.college.edu
Alternatives to Proxy Servers for remote resource access
- Put up an authenticated web page with passwords from your database vendors.
- Use another form of authentication with the vendor other than IP address:
- Referrer URL
- Vendor-provided script
- Virtual Private Networks (VPNs) / Point-to-Point Tunnelling Protocol (PPTP) / Layer-2 Tunnelling Protocol (L2TP)
Interception Proxies
- Network devices or software in network devices at the Internet and/or Transport layer.
- Look at the network traffic for web transactions
- Rather than routing the transaction to the final destination, requests are sent to a proxy server
Why do we care?
- The good...
- Requires no changes to client browser or the construction of special URLs.
- ...the bad...
- According to network purists, a interception proxy violates the fundamental concept of the "invisible" network.
- ...and the ugly.
- The installation of interception proxies breaks IP address recognition for access to remote databases.
How do they work?
Back to the discussion of network layers:
| Application Layer | |
| Transport Layer | Port |
| Internet Layer | IP address |
| Data Link (or Network Interface) Layer | Ethernet Address |
Remember where proxy servers were located? Interception Proxies operate at a different location!
Follow the network path: normal web transaction
- Requests leaves client machine destined towards origin server.
- Network routers and switches move the transaction to the origin server.
- Received by the origin server: the IP address of the request is that of the client machine.
Follow the network path: proxy server transaction
- Requests leaves client machine destined towards proxy server (as directed by the browser configuration).
- Network routers and switches move the transaction to the proxy server.
- Received by the proxy server; request leaves proxy destined towards the origin server.
- Network routers and switches move the transaction to the proxy server.
- Received by the origin server: the IP address of the request is that of the proxy server.
Follow the network path: normal web transaction with an interception proxy
- Requests leaves client machine destined towards origin server (no changes to the browser configuration).
- Network routers and switches move the transaction to the origin server, but one of the routers detects that the request is an HTTP transaction. Using a proprietary protocol, passes the request to the interception proxy.
- Received by the interception proxy server; request leaves proxy destined towards the origin server.
- Network routers and switches move the transaction to the proxy server.
- Received by the origin server: the IP address of the request is that of the interception proxy server.
What to do?
- Ask your network staff if they are intending to install an interception proxy. Remind them of the effect of installing an interception proxy.
- Ask your network staff to ask your ISP if they have intentions of installing an interception proxy.
- Prepare a list of IP addresses for services which will need to be "excluded" from the interception proxy function.
Free Proxy Servers
For each package, we'll look at:
- Availability
- Proxy Type
- Platforms
- Pricing
- Proxy Characteristics
- Bandwidth Conservation
- Statistics
- Filtering
- Remote Resource Access
- Comments
Apache
"This module implements a proxy/cache for Apache. It implements proxying capability for FTP, CONNECT (for SSL), HTTP/0.9, and HTTP/1.0. The module can be configured to connect to other proxy modules for these and other protocols."
http://www.apache.org/docs/mod/mod_proxy.html
Availability
- Proxy Type
- Transparent Proxy, Non-transparent Proxy, Rewriting Proxy (with code development)
- Platforms
- Available pre-compiled for a wide variety of UNIX, Windows, Macintosh, and other operating systems
- Pricing
- Freely available -- no usage restrictions. Commercial support available.
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes.
- Filtering
- Yes, but limited to hosts/ip-addresses. (See ProxyBlock in documentation.)
- Apache's mod_proxy module is extendable with mod_perl to modify the outgoing request (for example, stripping off headers in order to create an anonymizing proxy) or to modify the returned page.
- Remote Resource Access
- Yes. Authentication via flat-file and UNIX database files.
- Authorization extendable using Apache APIs.
Comments
- May require knowledge of the chosen server platform in order to configure and support.
- Best used if your site is already running the Apache web server.
Squid
"Squid is a high-performance proxy caching server for web clients, supporting FTP, gopher, and HTTP data objects."
http://www.squid-cache.org/Doc/FAQ/FAQ-1.html#ss1.1
Availability
- Proxy Type
- Transparent Proxy, Non-transparent Proxy, Rewriting Proxy (with code development)
- Platforms
- Available for a wide variety of UNIX platforms. Must be compiled.
- "Recent Versions of Squid will compile and run on Windows/NT with the GNU-Win32 package. However, Squid does not yet perform well on Windows/NT."
- Pricing
- Freely available (GNU General Public License). Commercial support available.
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes. More than you could ever want.
- Filtering
- Yes. Very flexible.
- Remote Resource Access
- Yes.
Comments
- Very powerful and complex software. It may be over-kill for most situations.
- Requires knowledge of UNIX environment and the process of compiling and installing programs under UNIX.
Libproxy
"Libproxy is a simple rewriting pass-through proxy system designed especially for libraries."
From the README file
Availability
- Proxy Type
- Rewriting Proxy
- Platforms
- UNIX. Based on Apache, Perl, Mod_Perl, and other open source tools.
- Pricing
- Free. Available through author: Richard Goerwitz (richard@goerwitz.com)
Proxy Characteristics
- Bandwidth Conservation
- No, but can be connected to a caching proxy server.
- Statistics
- Yes.
- Filtering
- No, but can be connected to a proxy server with filters.
- Remote Resource Access
- Yes.
Comments
- Used at Brown University and elsewhere.
- Full source included; can be modified to fit the local environment.
- Requires knowledge of UNIX environment and the process of compiling and installing programs under UNIX. Knowledge of Perl and Apache will be helpful.
Delegate
"DeleGate is a multi-purpose application level gateway, or a proxy server which runs on multiple platforms. DeleGate mediates communication of various protocols, applying cache and conversion for mediated data, controlling access from clients and routing toward servers. It translates protocols between clients and servers, merging several servers into a single server view with aliasing and filtering."
http://wall.etl.go.jp/delegate/
Availability
- Proxy Type
- Transparent Proxy, Non-transparent proxy
- Platforms
- Unix, Windows and OS/2
- Pricing
- Freely available -- no usage restrictions.
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes.
- Filtering
- Yes. Includes a specialized (proprietary) programmable language to filter requests, responses, and entities.
- Remote Resource Access
- Unclear. Includes a feature called "Proxy by URL Redirection" which may form the basis of a rewriting proxy.
Comments
- Much more than a web proxy. Also proxies "FTP, Telnet, NNTP, SMTP, POP, IMAP, LPR, LDAP, ICP, DNS, SSL, Socks, and more."
- Active development (new version released this week) and active user mailing list.
- Documentation lags behind released version.
Commercial Proxy Servers
Microsoft Internet Security and Acceleration (ISA) Server
"The enterprise firewall and Web cache server."
http://www.microsoft.com/isaserver/
Availability
- Proxy Type
- Transparent Proxy, Non-transparent Proxy, Rewriting Proxy (with code development)
- Platforms
- Windows 2000
- Pricing
- US $1,499
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes.
- Filtering
- Supports filtering by domain names
- Supports 3rd-party plug-ins with categorized lists of sites to be blocked
- Remote Resource Access
- No.
Comments
- Formerly the Microsoft Proxy Server
- More than a proxy server:
ISA Server includes an extensible, multilayer enterprise firewall featuring security with packet-, circuit-, and application-level traffic screening, stateful inspection, broad application support, integrated virtual private networking (VPN), system hardening, integrated intrusion detection, smart application filters, transparency for all clients, advanced authentication, secure server publishing, and more.
iPlanet Proxy Server
"The iPlanet Web Proxy Server is a powerful system for caching and filtering Web content and boosting network performance."
http://www.iplanet.com/products/iplanet_proxy/home_2_1_1ae.html
Availability
- Proxy Type
- Transparent Proxy, Non-transparent Proxy, Rewriting Proxy (with code development)
- Platforms
- HP-UX, AIX, Solaris, Windows NT, Windows 2000 Server, Windows 2000 AS
- Pricing
- Unknown
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes.
- Filtering
- Yes (URLs, content, content types, and outgoing header filters)
- Supports 3rd-party plug-ins with categorized lists of sites to be blocked.
- Remote Resource Access
- Yes. Supports LDAP-based authentication.
WinProxy
"WinProxy provides everything you need to simultaneously connect all your computers to the Internet through just one simple connection with your existing service provider."
http://www.winproxy.com/
Availability
- Proxy Type
- Transparent Proxy, plus Network Address Translation (NAT)
- Platforms
- Windows 95/98, or NT (3.51 or higher)
- Pricing
- $799.95 (unlimited user)
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes.
- Filtering
- Yes. Either "Site blacklisting" or "Site whitelisting"
- Supports 3rd-party plug-ins with categorized lists of sites to be blocked.
- Remote Resource Access
- No.
Comments
- Much more than a web proxy server. Also acts as a firewall and a gateway machine for an internal LAN.
EZproxy
"EZproxy provides the easiest way for libraries to extend web-based licensed databases to their remote users."
http://www.usefulutilities.com/ezproxy
Availability
- Proxy Type
- Rewriting Proxy
- Platforms
- Linux, Windows NT
- Custom compiling for other UNIX platforms available
- Pricing
- US $495 per server (plus sales tax in Arizona)
Proxy Characteristics
- Bandwidth Conservation
- No. No caching built in. EZproxy can be chained to another proxy server.
- Statistics
- Yes. Statistics stored in Common Logfile Format without usernames
- Filtering
- No.
- Remote Resource Access
- Yes. Authentication via IP address, text file of usernames/passwords, FTP/IMAP/POP login, or an extensible API based on HTTP requests. Latest version also includes authentication by LDAP, Radius, and INNOPAC Patron-API as well as the ability to limit database access to specific groups of users.
Obvia
"Remote Database Access (RDA) Service: The Complete Solution for all your Remote Authentication Needs"
http://www.obvia.com/
Availability
- Proxy Type
- Rewriting Proxy
- Platforms
- Obvia-hosted, Windows/NT
- Pricing
- Varies
Proxy Characteristics
- Bandwidth Conservation
- No.
- Statistics
- Yes. Comprehensive usage statistics module.
- Filtering
- No.
- Remote Resource Access
- Yes. Authentication via flat-file, ILS (DRA or INNOPAC), Kerberos, LDAP, POP3, Netware, Windows NT, or custom interface.
Comments
- Offers turn-key service from one of their data centers (the ultimate in bandwidth-saving!).
WebManager from Sagebrush Corp
"WebManagerTM is award-winning Internet content management software. With WebManager, students will learn more and finish research faster because they'll encounter fewer distractions and delays. WebManager's caching speeds Internet access up to ten times faster - increasing the number of students you can effectively serve, without increasing the number of computers required!"
http://www.sagebrushcorp.com/tech/webmanager.cfm
Availability
- Proxy Type
- Non-transparent proxy
- Platforms
- Windows NT, Solaris, Linux
- Pricing
- Unknown
Proxy Characteristics
- Bandwidth Conservation
- Yes.
- Statistics
- Yes. Comprehensive, detailed reports.
- Filtering
- Yes. "Allow lists" and "Deny lists"
- Remote Resource Access
- No.
Remote Patron Authentication from epixtech
"As libraries increasingly use Web-based content subscription services, they need to authenticate remote patrons who use the Web to access online resources. Remote Patron Authentication (RPA) from epixtech enables libraries to authenticate patrons outside a library facility before providing them access to restricted resources."
http://www.epixtech.com/product/rpa.htm
Availability
- Proxy Type
- Not a proxy
- Platforms
- "Core application: Web server with CGI 1.1 support"
- "Reporting component: Intel-based Windows NT server, ODBC capable SQL database management system"
- Pricing
- Unknown
Proxy Characteristics
- Bandwidth Conservation
- No.
- Statistics
- Yes (summarized).
- Filtering
- No.
- Remote Resource Access
- Yes. Authentication via 3M SIP1/SIP2 or ILS patron authentication.
Comments
- Client interacts directly with database vendor after authentication
- Provides access to vendor databases via three authentication methods:
- Referring URL
- URL-Embedded Username and Password
- Database Vendor provided Script
- Clients view the database list in a framed or non-framed environment. JavaScript is required.
Web Access Management (WAM) from Innovative Interfaces
Availability
- Proxy Type
- Transparent Proxy, Rewriting Proxy
- Platforms
- Requires INNOPAC ILS software
- Pricing
- Varies
Proxy Characteristics
- Bandwidth Conservation
- No.
- Statistics
- Yes (summarized).
- Filtering
- No.
- Remote Resource Access
- Yes. Authentication by INNOPAC Patron Validation.
Comments
- Product started as a rewriting proxy server.
- Added a non-rewriting proxy server with a PAC file.
- Next version will integrate EZproxy (become a rewriting proxy server again).
Resources
- Presentation web site
- http://www.PandC.org/proxy/
Lists of Links
- Authentication and Authorization list of links
- http://library.smc.edu/rpa.htm
- Access Log Analyzers
- http://www.uu.se/Software/Analyzers/Access-analyzers.html
- Apache Authentication Modules
- http://modules.apache.org/search?search=Authentication&query=true
- Google's List of Proxy Resources
- http://directory.google.com/Top/Computers/Software/Internet/Servers/Proxy/
- Yahoo List of Proxy Resources
- http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Servers/Proxies/
Specific Documents
- "Library Web Proxy Use Survey Results." Information Technology and Libraries 20, no. 4 (2001): 172-178.
- http://www.lita.org/ital/ital2004.html#anchor167252
- Hypertext Transfer Protocol -- HTTP/1.1(RFC 2616)
- http://www.rfc-editor.org/rfc/rfc2616.txt
- Internet Web Replication and Caching Taxonomy
- http://www.rfc-editor.org/rfc/rfc3040.txt
- Caching Tutorial for Web Authors and Webmasters
- http://www.wdvl.com/Internet/Cache/
- Pass-Through Proxying as a Solution to the Off-Campus Web-Access Problem (Brown University)
- http://www.brown.edu/Facilities/CIS/Network_Services/libproxy/
- Navigator Proxy Auto-Configure File Format
- http://www.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html
- Blacklist of sites that provide email, chat, and game-playing
- http://www.riverofdata.com/tools/blacklist.htm
- How to Lock-In IP Addresses on Netscape Navigator
- http://northville.lib.mi.us/tech/lockin.htm
- How to Lock-In Web Addresses on Internet Explorer 5
- http://tech.tln.lib.mi.us/lockinie.htm
- The Web Proxy Auto-Discovery Protocol
- http://www.web-cache.com/Writings/Internet-Drafts/draft-ietf-wrec-wpad-01.txt Expired Internet-Draft
- Luotonen, Ari, Web Proxy Servers (Prentice Hall, 1998).
- http://www.amazon.com/exec/obidos/ASIN/0136806120/
- CNI White Paper on Authentication and Access Management Issues in Cross-organizational Use of Networked Information Resources
- http://www.cni.org/projects/authentication/authentication-wp.html
- Trust*E Privacy Resource Guide
- http://www.truste.org/bus/pub_resourceguide.html
Proxy-related tools
- Display HTTP Headers
- http://www.web-caching.com/showheaders.html
- Cacheability Query
- http://www.ircache.net/cgi-bin/cacheability.py
User Documentation
- Proxy Auto-Configuration setup (selected sites)
- Central Michigan University: http://ocls.cmich.edu/remoteindex.htm
- Northwestern University: http://www.library.nwu.edu/help/proxy/
- Purdue University: http://www.lib.purdue.edu/proxy/
Free Proxy Servers
- Apache
- http://httpd.apache.org/
- Squid
- http://www.squid-cache.org/
- DeleGate
- http://www.delegate.org/delegate/
Commercial Proxy Servers
- Microsoft ISA Server
- http://www.microsoft.com/isaserver/
- iPlanet Proxy Server
- http://www.iplanet.com/products/iplanet_proxy/home_proxy.html
- WinProxy
- http://www.winproxy.com/
- EZproxy from Useful Utilities
- http://www.usefulutilities.com/ezproxy/
- Remote Database Access from Obvia
- http://www.obvia.com/
- WebManager from Sagebrush Corp
- http://www.sagebrushcorp.com/tech/webmanager.cfm
- Remote Patron Authentication from epixtech
- http://www.epixtech.com/products/rpa.asp
- Web Access Management (WAM) from Innovative Interfaces
- http://www.iii.com/html/products/p_map.shtml
Future Developments
Project Shibboleth
Shibboleth, a project of Internet2's Middleware Architecture Committee for Education, is investigating technology to support inter-institutional authentication and authorization for access to Web pages. The intent is to support, as much as possible, the heterogeneous security systems in use on campuses today, rather than mandating use of particular schemes like Kerberos or X.509-based PKI.
Adapted from http://middleware.internet2.edu/shibboleth/
Shibboleth \Shib"bo*leth\, n. [Heb. shibb[=o]leth an ear of corn, or a stream, a flood.]
- A word which was made the criterion by which to distinguish the Ephraimites from the Gileadites. The Ephraimites, not being able to pronounce sh, called the word sibboleth. See --Judges xii.
- Hence, the criterion, test, or watchword of a party; a party cry or pet phrase.
Adapted from http://middleware.internet2.edu/shibboleth/why-shibboleth.html and Webster's Revised Unabridged Dictionary (1913).
Project Goals
- "...a standards based vendor independent web access control infrastructure that can operate across institutional boundaries."
- "This project seeks to define [...] standards for the secure exchange of trusted interoperable information which could be used in authorization decisions."
- "The goal is is to develop and promulgate an architecture, which can then be used in a multi-vendor, open source, standards based environment."
What it means for us?
- Information providers and network infrastructure groups come to an agreement (a protocol) on how to exchange authentication, authorization, and demographic information.
- The user is in control over how much information is released about himself or herself on a provider-by-provider basis.
- Eliminates the use of proxy servers for remote resource access.
- Enhance statistics to include what demographic group is using what resources.
What it the status of the project?
- Working group of Internet2 is designing the protocol
- A call for participants was released a year ago and the "Club Shib" participants selected
Wrap-up / Evaluation