Fluent Essays - Research Paper Summary

RESEARCH PAPER SUMMARY

Research Paper Summary - Computer Science

PLACE ORDER

What is the paper about? Questions are in attached .jpeg. Please go through the attachments gitimate responses. Thus, both of these adversaries necessarily create an observable artifact: the victim, if it waits sufficiently long, will receive both the attacker’s packet and the legitimate reply. (We employed a similar form of this anomaly to detect TCP reset injection attacks [7].) Only an in-path adversary, capable of blocking and modifying packets, can prevent the legitimate reply from reaching the victim. Although in-path approaches have more power, on-path ap- proaches have several advantages, making their use appealing for attackers. Censorship tools commonly use on-path rather than in-path techniques to ease deployment and to make the system failure and load tolerant, as the censorship system can then operate on a traffic mirror rather than the live traffic.1 Similarly, on-path WiFi packet injection works without modifying drivers, but suppressing legitimate replies requires hardware-specific access to the low-level air interface to detect and squelch a broadcast in flight. B. Vulnerability of current implementations Systems that implement the DNS standard [2], [8] are vulnerable to on-path spoofing, despite the presence of the later legitimate reply, because the resolver attempts to “get the answer as quickly as possible” [2]. Upon receiving a reply, the resolver checks the ID field in the header and then will “verify that the question section corresponds to the information currently desired” [8]. Clearly, these steps do not provide sufficient diligence, as the design goal of quickly returning an answer causes the resolver to return the attacker’s value. DNSSEC adds cryptographic authentication to prevent the acceptance of invalid DNS replies [9], [10], [3]. Although attackers cannot redirect victims using spoofed replies, they can still perform denial-of-service attack, which will often suffice to satisfy a censor’s objective. DOS occurs because the resolver will attempt to process the attacker’s packet, determine that the DNSSEC signature is absent or invalid, and immediately return “Bogus”, depriving the client from the ability to connect to the host corresponding to the name. Because of this behavior, DNSSEC does not suffice as a replacement for a mechanism such as Hold-On: resolvers need to maintain an open port for a period of time in order to attempt to validate all responses received for a query, not just the first. C. Other related work DNS has a long history of poisoning attacks [4], [5], [11], [12]. Beside those mentioned above, several previous efforts counter DNS poisoning attack by increasing the difficulty of blindly injecting DNS answers [13], [14], [6], [15]. These efforts focus on deterring off-path injection by increasing the information entropy required to match a valid DNS reply. Our work, however, addresses the threat from attackers that 1TCP traffic control tools also have used this vantage point. For example, Comcast deployed Sandvine’s Policy Traffic Switch devices to disrupt BitTor- rent traffic in an on-path configuration [7], even though the devices themselves supported in-path operation. can observe queries, which allows them to circumvent these previous defenses. Poisoning attacks based on on-path injection are not limited to DNS. Malicious injection, such as TCP RST and ICMP unreachable messages, have been used in both individual attacks [7] and ISP-scale censorship [16], [17]. Similar to DNS poisoning, traffic sent from the remote peer of the legitimate communication will still arrive at the victim after these malicious injections. Therefore, the use of Hold-On mechanisms similar to those explored here will likely have applicability to deter these malicious injections as well. III. HOLD-ON AND DILIGENT VALIDATION As a consequence of the inability for on-path attackers to suppress legitimate replies, we investigate the benefits of stub resolvers or forwarders waiting for a “Hold-On” period to allow subsequent legitimate replies to arrive. Part of this procedure entails validating replies with more diligence when a resolver receives two or more replies for the same question. This improvement effectively protects against DNS injections in the case of non-disruptive attacks, where the attacker lacks the ability to to block either the resolver’s request or the authority’s response. A. Assumptions We predicate our approach on the following assumptions, which we view as reasonable based on our experience with censorship activity that employs on-path injectors: (1) The user under attack or censorship is able to access a trustworthy recursive resolver outside of the attacked or censored networks, such as Google Public DNS [18] and OpenDNS [19], which they frequently use. In particular, in the censorship case, we assume that the censor does not block access to this resolver, which we argue is a plausible assumption given the large number (158,364 in January 2012) of known open resolvers [20]. (2) The attacker/censor injects fake responses according to a blacklist rather than a whitelist. That is, the user knows some non-sensitive domain names that can be used to measure normal (non-interfered by the attacker) communication be- tween the client (stub resolver) and the DNS server (recursive resolver). (3) The attacker injects fake replies as quickly as possible in order to ensure that the replies arrive earlier than the legitimate ones. Hence, the injection mechanism will transmit immediately upon seeing the client’s request. The mechanism cannot wait for the arrival of the legitimate reply from the server because by doing so, the injection may arrive after it, and fail to work. (4) The attacker cannot construct a properly signed DNSSEC response. Based on these assumptions, the stub resolver can estimate when it expects legitimate replies to arrive, in order to discern between injected replies and correct ones. Stub Resolver Injector Recursive Resolver DNS QueryInjected Reply TTL=231 Injected Reply TTL=8 Legitimate Reply TTL=20 Hold-On Timer 15 seconds expected reply: TTL:20; timing: after 200ms arrives after 10ms TTL mismatch, ignore arrives after 250 ms TTL match; accept stop Hold-On Timer arrives after 50ms TTL mismatch, ignore Fig. 1. Hold-On while waiting for a legitimate DNS reply. B. Hold-On and Validation The stub resolver or forwarder needs to first learn the expected RTT and hop-count distance (in terms of expected TTL) associated with communication involving its remote recursive resolver, which it does using active measurement. (Recall that we presume the remote resolver lies outside of the censored network.) Upon start-up, the resolver issues a series of non-sensitive queries to measure the initial RTT and TTL seen on arriving replies for entries cached at the remote resolver by repeatedly querying for the same name. During this period, the resolver maintains an open port for an additional period to validate that an on-path adversary has not tampered with these initial measurements by injecting replies. During normal operation, the stub resolver also continually updates these values based on passive measurements of its ongoing traffic. Given estimates of the legitimate RTT and TTL, the resolver works as shown in Figure 1: (1) After issuing a DNS query, the resolver starts its Hold- On timer. A natural setting for the timer would be 15 seconds, as this reflects the default timeout value for both the BIND resolver [21, p. 108] and Microsoft Windows [22]. Naturally, in most cases the resolver will return much sooner, unless the remote resolver is unreachable. (2) When the resolver expects a DNSSEC-protected re- sponse, for each reply it performs a local signature validation. It returns to the client the first fully validated reply. If it finds all replies as either Insecure, Bogus, or Indeterminate [3, p. 20], and the Hold-On timer expires, the resolver returns a DNSSEC error. (3) Without DNSSEC, upon receiving a reply before the Hold-On timer expires, the resolver performs two additional validations: • Timing. Does the reply arrives too early? The test we use here is for replies that arrive sooner than half of the expected (measurement-derived) RTT. We note that the resolver could also determine this threshold more precisely by measuring known injections in the resolver’s actual environment by generating queries for censored names to non-existent resolvers. • TTL. Does the TTL field in the IP header have the expected value(s)? We assume that the route between the remote DNS server and the client is stable in at least short periods (such as 5 minutes), so we can get and update the expected TTLs by periodical measurement. Upon observing either of the above mismatches, the resolver ignores the response and continues to wait. If on the other hand a reply arrives before the Hold-On time expires and validates based on the above tests, the resolver accepts the new reply and returns it to the client. If the stub resolver receives no valid reply before the Hold- On timer expires, it returns the latest non-validating reply it observed. Doing so means that in the presence of significantly changed network conditions, users experience delay, but not inadvertent blocking of their access. In most cases, the resolver will not wait until the Hold- On timer timing out; it will stop waiting upon receipt of a legitimate response. Thus, generally this approach will not cause extra delay, except in the case that network conditions have changed such that legitimate replies now return sooner and without DNSSEC protection. IV. FEASIBILITY ASSESSMENT To assess the viability of our approach, we investigate the phenomenon of observing multiple replies for a single DNS query in both a censored network and a non-censored network. In the latter, we look at whether normal DNS traffic generates such replies; that is, whether Hold-On and validation could cause significant false positives. In the censored network, we assess how different the injected replies appear from the legitimate ones, which indicates whether the approach could suffer from significant false negatives. A. Observation in an uncensored network We can view use of the Hold-On approach as a form of anomaly detector, looking for a condition that represents an attack. Although it is clear that a packet-injection based DNS attack must create an anomaly where the client receives two distinct replies, we must ensure that normal DNS traffic does not generate these anomalies, as, in some cases, there may be no effective resolution beyond simply noting the attack and returning no valid answer if it proves impossible to heuristically distinguish an attacker’s packet from a legitimate non-DNSSEC signed reply. If the resolver simply ignores replies it cannot validate (and returns the last such, if no valid replies are received), then such anomalies arising in legitimate traffic will not in fact cause any problems. If, however, the resolver flags such replies as reflecting an attack, then these false positives will incur a degree of collateral damage. We developed a Bro [23] IDS policy script to directly detect anomalous secondary DNS replies. This script operates by tracking all DNS requests and matching replies, checking any subsequent reply that arrives within a 1-minute timeout2 to determine whether the number of records in the reply and the contents of each are unchanged. We validated that this script accurately detects attack packets using traces of injected packets we captured by sending DNS query requests that transited a network that uses DNS-based packet-injection censorship. We ran this script against 6 days of normal DNS traffic captured at ICSI’s border, consisting of 11,700,000 DNS requests.3 During this period we observed no DNS anomalies that would create a false positive, only deliberate testing intended to trigger a DNS censorship system. Running on a 1.5 hour trace gathered in August 2011 at the UC Berkeley campus border (a total of 15.2M DNS transactions,4 both inbound and outbound), we observed two benign authorities that triggered the basic anomaly detector. The first server, an authority server for the BBC, returned two distinct replies for the same query for several names. Although distinct in value, both values were within the same /24 subnet. The second, an authority for businessinsider.com, returned two values for the same query. The first reply was a CNAME to an external domain with the root authority information included in the reply, while the second was a SERVFAIL containing the same CNAME but no authority or additional fields, triggering the alert. We also observed both multiple incidents of DNS censorship (caused by local users configured to use resolvers in a censored country) and a few false-positives due to script bugs that would not disrupt a Hold-On resolver. B. Observation in a censored network To assess potential false negatives, we trigger a DNS censor- ship system to inject DNS replies with sensitive domain names (such as twitter.com). We generated these measurements from within the censored network, communicating with destinations outside the censored network. To differentiate the legitimate from the injected replies, we first query a non-existent DNS 2We chose a longer timeout to be conservative in this analysis, attempting to detect potential anomalies that would not affect a resolver using Hold-On. 3We excluded lookups issues by an ICSI measurement tool. 4Excluding a known high-volume DNS crawler used for research. Fig. 2. Comparison of arrival times for legitimate packets (green cross) and packets injected by censor (red plus) Fig. 3. Comparison of TTLs for legitimate packets (green cross) and packets injected by censor (red plus) server outside the censored network with sensitive names, and we receive only injected replies. We then query an open DNS server with non-sensitive names (such as www.mit.edu), by which we receive only legitimate replies. With this method, we collected a data trace including ⇡ 100,000 queries and corresponding replies over 9 days. Figures 2 and 3 show comparisons of RTTs and TTLs observed of legitimate DNS packets and injected packets by the DNS censor. It appears not difficult to identify the legitimate packets from injected. Most injected packets arrive much earlier than legitimate ones because the injector and the client reside within the same ISP, while the DNS server resides in another country. We found the values of IP TTL from the legitimate DNS responses are quite stable over a period of 9 days (either 44 or 42), but the TTL value of the injected packets varied in the range of [0–255], presumably to avoid simple filtering. In another 10-hour trace, we select one pair of (RTT, TTL) every 5 minutes, and use this as the expected RTT and TTL to validate other packets in the following time window. In our experiment, we change the threshold of TTL and RTT Stub Resolver (End User) DNS Forwarder DNS proxy Recursive Resolver Attacker (Injector) Fig. 4. Environment of DNS proxy to evaluate the false positive rate and false negative rate, as shown in Table I. For example, if we set the threshold of TTL to 1 (that is, the reply is valid only if TTL 2 [expected TTL�1, expected TTL+1]) and set the threshold of RTT to 0.5 · expected RTT (that is, the reply is valid only if it does not arrive 0.5 · expected RTT earlier than expected), then the approach does not generate any false positives or negatives. TTL threshold RTT threshold FP (\%) FN (\%) 0-2 0.5 0 0 3 0.5 0 0.01 4 0.5 0 0.06 5 0.5 0 0.07 6 0.5 0 0.10 7 0.5 0 0.11 2 0.1 5.96 0 2 0.2 1.53 0 2 0.3-0.8 0 0 2 0.9 0 0.31 TABLE I FALSE POSITIVE (FP) AND FALSE NEGATIVE (FN) RATES CORRESPONDING TO DIFFERENT THRESHOLDS FOR IP TTL AND RTT DIFFERENCES. V. IMPLEMENTATION AND EVALUATION We implemented a DNS proxy to explore how Hold-On works in practice. The proxy operates as a DNS forwarder that aims to protect against DNS injection by on-path adversaries, as illustrated in Figure 4. A. Design and implementation of a DNS proxy To estimate the expected RTT and TTL to/from the remote recursive resolver, the proxy issues requests upon start-up for non-sensitive names.5 To estimate the RTT, the resolver queries the same name multiple times, selecting the minimum of RTT observed. The resolver excludes the first query, be- cause it might include additional time consumed by the server to resolve the name recursively, rather than answering from its cache. The expected TTL(s) should typically remain constant, but could vary due to routing changes.6 We assume that the set of expected TTLs does not vary in a measurement period (see 5It could instead simply monitor initial queries for duplicate replies, and formulate its estimates from those that engender only a single reply. Doing so would also help with combating injection from attackers who have different goals than censorship. 6A potentially pathological case would be replies that vary across a set of arriving TTL values due to the use of per-flow load-balancing that causes different replies to take different routes. Algorithm 1 Hold-On and Validation for DNS Proxy Timeout 5 while GetDNSRequestFromClient(request) do retry 1; gotAnyReply false repeat ForwardRequestToResolver(Resolver, request); StartHoldOnTimer(retry · Timeout); while NOT Timeout and GetDNSReply(replyPkt) do gotAnyReply true { from server or injector} if ValidateDNSSEC OK(replyPkt) then SendDNSReplyToClient(replyPkt.msg) StopHoldOnTimer() return else if ValidateTTL OK(replyPkt.ipTTL) and ValidateRTT OK(replyPkt.RTT ) then SendDNSReplyToClient(replyPkt.msg) StopHoldOnTimer() return else DropAndLog(replyPkt) end if end while retry retry + 1 until retry == 3 if gotAnyReply then {No valid reply, return the latest non-validating reply} SendDNSReplyToClient(replyPkt.msg) end if end while below). In our current implementation, the set has only one value. During its normal operation, a separate thread repeats this measurement (see § IV-B) periodically (such as every 5 minutes) and updates the expected RTT and TTL values adapted to potential change of network status. Algorithm 1 details how the proxy processes with DNS requests and replies. When the proxy receives a DNS request from its client (end user or DNS forwarder), it forwards the request to the remote recursive resolver and starts the Hold-On timer. We set the initial value of the timer to 5 seconds; if no legitimate reply after the timer expires, we reset the timer to 10s for the second try, and similarly to 15s for the third try. If the proxy receives a DNS reply (from either the remote re- cursive resolver, or an injector), it validates both TTL and RTT against the expected values (the expected TTLs could include multiple values because of multiple paths to the resolver). If the request is DNSSEC enabled, the corresponding reply should also be checked with DNSSEC options (not imple- mented yet in our prototype). For DNSSEC-disabled requests, ValidateDNSSEC OK always returns false. ValidateRTT OK and ValidateTTL OK return true if: expected RTT �replyPkt.RTT < 0.5 · expected RTT replyPkt.ipTTL 2 expectedTTLs Fig. 5. For cached names, compare of the query time with Hold-On enabled (green square) and Hold-On disabled (red cross). Most of the points in the two set overlap, so only a green line is shown. B. Evaluation To evaluate the delay increased by Hold-On, we use two sets of domain names for cached or uncached queries. For the first set, we used a fixed prefix (“www”) with different levels of domains appended (“.” or “null”, “com”, “org”, “net”, “cn”, “jp”, “de”, “google.com”, “mit.edu”). For the first set, we query each name first, let the recursive resolver cache the result, and then measure the query time for subsequent queries. As shown in Figure 5, the time needed for each query approximates the RTT. We generated the second randomly using a nonsense prefix again with different level of domains appended. We query each name only once. Because the resolver must fully resolve the name for each query, the query time includes RTT (from the proxy to the recursive resolver) as well as the subsequent time consumed by the resolver (from recursive resolver to au- thoritative name servers). As shown in Figure 6, the time varies considerably for different level of domain. For exam- ple, it is much faster to resolve “nonsense12847.google.com” than “nonsense2132323.de” because the resolver (in our case, Google Public DNS) has a low-latency path to a google.com authority, but a higher latency path to the .de authority. As shown in Figure 5 and 6, in both cases (cached or uncached query) Hold-On and validation do not introduce any apparent delay. VI. DISCUSSION The current Hold-On implementation operates as a stub resolver to a known-uncensored remote recursive resolver, which enables accurate initial measurement of RTT and TTL to enable subsequent detection of unexpected replies. Without DNSSEC, it relies on attack packets exhibiting significant differences in IP TTL or RTT in order to distinguish them from legitimate replies. Fig. 6. For uncached names, compare of query time with Hold-On enabled (green square) and Hold-On disabled (red cross). Stub Resolver Query Injector Recursive Resolver ReplyInjection Fig. 7. The attacker carefully crafts the packets’ TTL and injects them with the expected timing If the attacker carefully crafts the attack packets’ TTLs, and only injects them just before the likely arrival of the legitimate reply (see Figure 7), our Hold-On technique without DNSSEC will fail to distinguish between the legitimate reply and the attack packet.7 However, the resolver can still detect that an at- tack has likely occurred (due to the differing responses). If the resolver cannot correctly determine the legitimate response, it should conservatively not return any answer—though doing so still enables attackers to impose denial-of-service. Hold-On can provide a robust defense against on-path injection only when combined with DNSSEC. Extending our implementation to recursive resolvers will take some additions. However, most recursive resolvers al- ready maintain estimated RTTs to different authority servers in order to select between nameservers. This timing data, combined with tracking of TTLs, can enable detection (if not protection) of injected packets. In addition, as developed so far, our particular Hold-On mechanism does not suffice to protect against on-path adver- saries operating in environments that users only occasionally 7The adversary must take the risk that the injected packets arrive later than the legitimate ones. visit, such as public WiFi hotspots; we lack knowledge of “non-sensitive” domains to look up in order to obtain estimates of RTT and TTL in the presence of non-injection. In such environments, we can in principle instead look for symptoms of injection (not only for DNS, but potentially for DHCP, ARP, and TCP), alert the user in some fashion, and terminate any associated connections. Clearly, developing a robust, workable approach along these lines will take significant investigation. Finally, a recursive resolver should also continue to listen for replies even after it processes and forwards an answer, in order to detect attacks. If the resolver detects an attack, it could revert to a more paranoid mode (performing active measurements of RTTs and TTLs for all new resolvers to detect subsequent attacks). In addition, it can flush all cache entries associated with the attack if it is unable to distinguish between the legitimate reply and the attack. (If it can tell that the later reply is the legitimate one, the resolver can use it to replace the cache entry.) Doing so limits the damage from the injected packet to the improperly returned question. It also means that the user, upon noticing a failure or redirection for a potentially censored site, can simply attempt to reload the site, this time receiving the valid reply for their answer, circumventing the censorship. Finally, the extended recursive resolver can also itself issue an additional response (with the second reply’s contents) in response to the client’s query, so that any client or stub that also detects mismatched replies will be aware of the attack. VII. CONCLUSION Packet injection by on-path adversaries poses a particular threat against DNS. One particular use of such injection comes from those building DNS-based censorship tools. Even DNSSEC-validating implementations may be vulnerable, as injected packets can cause denial-of-service if the resolver fails to wait for the legitimate reply before returning “Bogus”. Such adversaries cannot however block the receipt of the legitimate DNS reply in addition to their injection. We show that this artifact—differing replies to the same question— appears to occur only very rarely outside of an actual attack. Thus, stub resolvers can use its presence to detect when packet injection occurs. In addition, unless censors take additional steps, imperfec- tions in their packet injection tools can allow resolvers to not only detect injection attacks, but also to potentially distinguish between legitimate replies and injected answers based on artifacts in the IP header’s TTL field and in the round-trip time when receiving replies. Accordingly, we propose that resolvers should utilize a Hold-On period, waiting for additional replies. If a reply arrives too early, with an unexpected IP TTL, or fails to validate (if DNSSEC validation occurs), the stub waits for the potential arrival of a subsequent legitimate reply before proceeding. For censors who take steps to match the TTL and RTT of their injections with those expected for legitimate replies, Hold-On still allows detection that injection has occurred; and, for DNSSEC-enabled resolution, prevents the censor from imposing denial-of-service on obtaining the correct reply. Finally, we developed a DNS forwarder that implements Hold-On, and demonstrated that this forwarder is effective at distinguishing legitimate replies from those injected by a widely deployed network censorship system. Our evaluation found that use of such forwarding imposes minimal additional latency on undisrupted DNS queries, and thus Hold-On has promise for enabling more robust DNS resolution in the face of on-path censorship. VIII. ACKNOWLEDGMENTS This work was sponsored in part by NSF grants 0831780, 0905631, and CNS-1015835. Special thanks to Robin Sommer for assistance with the UC Berkeley network traces. REFERENCES [1] G. Lowe, P. Winters, and M. L. Marcus, “The Great DNS Wall of China,” Dec. 2007. http://cs.nyu.edu/\%7Epcw216/work/nds/final.pdf. [2] P. Mockapetris, “Domain Names—Concepts and Facilities, RFC 1034.” http://www.ietf.org/rfc/rfc1034.txt. [3] R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose, “Protocol Modifications for the DNS Security Extensions,” 2005. http://www.ietf.org/rfc/rfc4035.txt. [4] S. M. Bellovin, “Using the Domain Name System for System Break- ins,” in Proceedings 5th USENIX Security Symposium, 1995. [5] D. Kaminsky, “Black ops 2008: It’s the end of the cache as we know it,” Black Hat USA, 2008. [6] D. Dagon, M. Antonakakis, P. Vixie, T. Jinmei, and W. Lee, “Increased DNS Forgery Resistance Through 0x20-bit Encoding: security via leet queries,” in Proceedings of the 15th ACM conference on Computer and communications security, pp. 211–222, ACM, 2008. [7] N. Weaver, R. Sommer, and V. Paxson, “Detecting Forged TCP Reset Packets,” in NDSS’09, 2009. [8] P. Mockapetris, “Domain Names—Implementation and Specification, RFC 1035.” http://www.ietf.org/rfc/rfc1035.txt. [9] R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose, “DNS Security Introduction and Requirement,” 2005. http://www.ietf.org/rfc/rfc4033.txt. [10] R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose, “Resource Records for the DNS Security Extensions,” 2005. http://www.ietf.org/rfc/rfc4034.txt. [11] D. Dagon, N. Provos, C. Lee, and W. Lee, “Corrupted DNS Resolution Paths: The Rise of a Malicious Resolution Authority,” in Proceedings of Network and Distributed Security Symposium, 2008. [12] P. Roberts, “Chinese DNS Tampering A Big Threat To Internet Security,” 2010. https://threatpost.com/en\%5Fus/blogs/chinese-dns-tampering-big- threat-internet-security-112410. [13] J. G. Høy, “Anti DNS Spoofing-Extended Query ID (XQID).” http://www.jhsoft.com/dns-xqid.htm, 2008. [14] D. Atkins and R. Austein, “Threat Analysis of the Domain Name System (DNS),” RFC3833, 2004. [15] B. Hubert and R. Mook, “Measures for Making DNS More Resilient against Forged Answers, RFC 5452,” 2009. http://tools.ietf.org/html/rfc5452. [16] R. Clayton, S. Murdoch, and R. Watson, “Ignoring the great firewall of china,” in Privacy Enhancing Technologies, pp. 20–35, Springer, 2006. [17] J. R. Crandall and E. Barr, “Conceptdoppler: A weather tracker for internet censorship,” in 14th ACM Conference on Computer and Com- munications Security, 2007. [18] “Google Public DNS.” http://code.google.com/speed/public-dns/. [19] … Congestion Avoidance and Control� Van Jacobsony Lawrence Berkeley Laboratory Michael J. Karelsz University of California at Berkeley November, 1988 Introduction Computer networks have experienced an explosive growth over the past few years and with that growth have come severe congestion problems. For example, it is now common to see internet gateways drop 10\% of the incoming packets because of local buffer overflows. Our investigation of some of these problems has shown that much of the cause lies in transport protocol implementations (not in the protocols themselves): The ‘obvious’ ways to implement a window-based transport protocol can result in exactly the wrong behavior in response to network congestion. We give examples of ‘wrong’ behav- ior and describe some simple algorithms that can be used to make right things happen. The algorithms are rooted in the idea of achieving network stability by forcing the transport connection to obey a ‘packet conservation’ principle. We show how the algorithms derive from this principle and what effect they have on traffic over congested networks. In October of ’86, the Internet had the first of what became a series of ‘congestion collapses’. During this period, the data throughput from LBL to UC Berkeley (sites separated by 400 yards and two IMP hops) dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. In particular, we wondered if the 4.3bsd (BerkeleyUnix) tcp was mis-behaving or if it could be tuned to work better under abysmal network conditions. The answer to both of these questions was “yes”. � Note: This is a very slightly revised version of a paper originally pre- sented at SIGCOMM ’88 [11]. If you wish to reference this work, please cite [11]. y This work was supported in part by the U.S. Department of Energy under Contract Number DE-AC03-76SF00098. z This work was supported by the U.S. Department of Commerce, Na- tional Bureau of Standards, under Grant Number 60NANB8D0830. Since that time, we have put seven new algorithms into the 4bsd tcp: (i) round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi) Karn’s clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested con- ditions on the Internet. This paper is a brief description of (i) – (v) and the ra- tionale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [15]. (vii) is described in a soon-to-be-published RFC (Arpanet “Request for Comments”). Algorithms (i) – (v) spring from one observation: The flow on atcp connection (oriso tp-4 or Xeroxns spp con- nection) should obey a ‘conservation of packets’ principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By ‘conservation of packets’ we mean that for a connec- tion ‘in equilibrium’, i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call ‘conservative’: A new packet isn’t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion.1 Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? 1 A conservative flow means that for any given time, the integral of the packet density around the sender–receiver–sender loop is a constant. Since 2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 2 There are only three ways for packet conservation to fail: 1. The connection doesn’t get to equilibrium, or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can’t be reached because of resource limits along the path. In the following sections, we treat each of these in turn. 1 Getting to Equilibrium: Slow-start Failure (1) has to be from a connection that is either starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a ‘clock’ to strobe new packets into the network. Since the receiver can generate acks no faster than data packets can get through the network, the protocol is ‘self clocking’ (fig. 1). Self clocking systems automatically adjust to bandwidth and delay variations and have a wide dynamic range (important considering thattcp spans a range from 800 Mbps Cray chan- nels to 1200 bps packet radio links). But the same thing that makes a self-clocked system stable when it’s running makes it hard to start — to get data flowing there must be acks to clock out packets but to get acks there must be data flowing. To start the ‘clock’, we developed aslow-startalgorithm to gradually increase the amount of data in-transit.2 Al- though we flatter ourselves that the design of this algorithm is rather subtle, the implementation is trivial — one new state variable and three lines of code in the sender: � Add acongestion window, cwnd, to the per-connection state. � When starting or restarting after a loss, set cwnd to one packet. � On each ack for new data, increase cwnd by one packet. packets have to ‘diffuse’ around this loop, the integral is sufficiently contin- uous to be a Lyapunov function for the system. A constant function trivially meets the conditions for Lyapunov stability so the system is stable and any superposition of such systems is stable. (See [2], chap. 11–12 or [20], chap. 9 for excellent introductions to system stability theory.) 2 Slow-start is quite similar to thecute algorithm described in [13]. We didn’t know aboutcute at the time we were developing slow-start but we should have—cute preceded our work by several months. When describing our algorithm at the Feb., 1987, Internet Engineering Task Force (IETF) meeting, we called itsoft-start, a reference to an elec- tronics engineer’s technique to limit in-rush current. The nameslow-start was coined by John Nagle in a message to the IETF mailing list in March, ’87. This name was clearly superior to ours and we promptly adopted it. � When sending, send the minimum of the receiver’s advertised window and cwnd. Actually, the slow-start window increase isn’t that slow: it takes timeR log2 W whereR is the round-trip-time and W is the window size in packets (fig. 2). This means the window opens quickly enough to have a negligible effect on performance, even on links with a large bandwidth–delay product. And the algorithm guarantees that a connection will source data at a rate at most twice the maximum possible on the path. Without slow-start, by contrast, when 10 Mbps Ethernet hosts talk over the 56 Kbps Arpanet via IP gateways, the first-hop gateway sees a burst of eight packets delivered at 200 times the path bandwidth. This burst of packets often puts the connection into a persistent failure mode of continuous retransmissions (figures 3 and 4). 2 Conservation at equilibrium: round-trip timing Once data is flowing reliably, problems (2) and (3) should be addressed. Assuming that the protocol implementation is correct, (2) must represent a failure of sender’s retransmit timer. A good round trip time estimator, the core of the retransmit timer, is the single most important feature of any protocol implementation that expects to survive heavy load. And it is frequently botched ([26] and [12] describe typical problems). One mistake is not estimating the variation,�R, of the round trip time,R. From queuing theory we know thatR and the variation inR increase quickly with load. If the load is � (the ratio of average arrival rate to average departure rate), R and�R scale like(1��) �1 . To make this concrete, if the network is running at 75\% of capacity, as the Arpanet was in last April’s collapse, one should expect round-trip-time to vary by a factor of sixteen (�2� to +2�). The tcp protocol specification[23] suggests estimating mean round trip time via the low-pass filter R �R + (1��)M whereR is the averagertt estimate,M is a round trip time measurement from the most recently acked data packet, and � is a filter gain constant with a suggested value of 0.9. Once the R estimate is updated, the retransmit timeout interval, rto, for the next packet sent is set to�R. The parameter� accounts forrtt variation (see [4], sec- tion 5). The suggested� = 2 can adapt to loads of at most 30\%. Above this point, a connection will respond to load increases by retransmitting packets that have only been de- layed in transit. This forces the network to do useless work, 2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 3 Figure 1:Window Flow Control ‘Self-clocking’ PrPr ArArAs PbPb ReceiverReceiverSenderSender AbAb This is a schematic representation of a sender and receiver on high bandwidth networks connected by a slower, long-haul net. The sender is just starting and has shipped a window’s worth of packets, back-to-back. The ack for the first of those packets is about to arrive back at the sender (the vertical line at the mouth of the lower left funnel). The vertical dimension is bandwidth, the horizontal dimension is time. Each of the shaded boxes is a packet. Bandwidth� Time = Bits so the area of each box is the packet size. The number of bits doesn’t change as a packet goes through the network so a packet squeezed into the smaller long-haul bandwidth must spread out in time. The timePb represents the minimum packet spacing on the slowest link in the path (thebottleneck). As the packets leave the bottleneck for the destination net, nothing changes the inter-packet interval so on the receiver’s net packet spacingPr = Pb . If the receiver processing time is the same for all packets, the spacing between acks on the receiver’s net Ar = Pr = Pb . If the time slotPb was big enough for a packet, it’s big enough for an ack so the ack spacing is preserved along the return path. Thus the ack spacing on the sender’s netAs = Pb . So, if packets after the first burst are sent only in response to an ack, the sender’s packet spacing will exactly match the packet time on the slowest link in the path. wasting bandwidth on duplicates of packets that will even- tually be delivered, at a time when it’s known to be having trouble with useful work. I.e., this is the network equivalent of pouring gasoline on a fire. We developed a cheap method for estimating variation (see appendix A)3 and the resulting retransmit timer essen- tially eliminates spurious retransmissions. A pleasant side effect of estimating� rather than using a fixed value is that low load as well as high load performance improves, partic- ularly over high delay paths such as satellite links (figures 5 and 6). Another timer mistake is in the backoff after a retrans- mit: If a packet has to be retransmitted more than once, how should the retransmits be spaced? For a transport endpoint embedded in a network of unknown topology and with an 3 We are far from the first to recognize that transport needs to estimate both mean and variation. See, for example, [5]. But we do think our estimator is simpler than most. unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff—but a proof of this is be- yond the scope of this paper.4 To finesse a proof, note that a network is, to a very good approximation, a linear system. That is, it is composed of elements that behave like linear op- erators — integrators, delays, gain stages, etc. Linear system theory says that if a system is stable, the stability is exponen- tial. This suggests that an unstable system (a network subject 4 See [7]. Several authors have shown that backoffs ‘slower’ than ex- ponential are stable given finite populations and knowledge of the global traffic. However, [16] shows that nothing slower than exponential behav- ior will work in the general case. To feed your intuition, consider that an IP gateway has essentially the same behavior as the ‘ether’ in an ALOHA net or Ethernet. Justifying exponential retransmit backoff is the same as showing that no collision backoff slower than an exponential will guarantee stability on an Ethernet. Unfortunately, with an infinite user population even ex- ponential backoff won’t guarantee stability (although it ‘almost’ does—see [1]). Fortunately, we don’t (yet) have to deal with an infinite user population. 3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 4 Figure 2:The Chronology of a Slow-start 11 22 33 11 One Round Trip TimeOne Round Trip Time 0R0R 1R1R 22 44 55 33 66 77 2R2R 44 88 99 55 1010 1111 66 1212 1313 77 1414 1515 3R3R One Packet TimeOne Packet Time The horizontal direction is time. The continuous time line has been chopped into one-round-trip-time pieces stacked vertically with increasing time going down the page. The grey, numbered boxes are packets. The white numbered boxes are the corresponding acks. As each ack arrives, two packets are generated: one for the ack (the ack says a packet has left the system so a new packet is added to take its place) and one because an ack opens the congestion window by one packet. It may be clear from the figure why an add-one-packet-to-window policy opens the window exponentially in time. If the local net is much faster than the long haul net, the ack’s two packets arrive at the bottleneck at essentially the same time. These two packets are shown stacked on top of one another (indicating that one of them would have to occupy space in the gateway’s outbound queue). Thus the short-term queue demand on the gateway is increasing exponentially and opening a window of sizeW packets will require W=2 packets of buffer capacity at the bottleneck. to random load shocks and prone to congestive collapse5 ) can be stabilized by adding some exponential damping (ex- ponential timer backoff) to its primary excitation (senders, traffic sources). 3 Adapting to the path: congestion avoidance If the timers are in good shape, it is possible to state with some confidence that a timeout indicates a lost packet and not 5 The phrasecongestion collapse(describing a positive feedback insta- bility due to poor retransmit timers) is again the coinage of John Nagle, this time from [22]. a broken timer. At this point, something can be done about (3). Packets get lost for two reasons: they are damaged in transit, or the network is congested and somewhere on the path there was insufficient buffer capacity. On most network paths, loss due to damage is rare (�1\%) so it is probable that a packet loss is due to congestion in the network.6 6 Because a packet loss empties the window, the throughput of any win- dow flow control protocol is quite sensitive to damage loss. For an RFC793 standard TCP running with windoww (wherew is at most the bandwidth- delay product), a loss probability ofp degrades throughput by a factor of (1 + 2pw)�1 . E.g., a 1\% damage loss rate on an Arpanet path (8 packet window) degradestcp throughput by 14\%. The congestion control scheme we propose is insensitive to damage loss until the loss rate is on the order of the window equilibration length (the number of packets it takes the window to regain its original size after a loss). If the pre-loss size isw, equilibration takes roughlyw2=3 packets so, for the 3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 5 A ‘congestion avoidance’ strategy, such as the one pro- posed in [14], will have two components: The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isn’t received. If packet loss is (almost) always due to congestion and if a timeout is (almost) always due to a lost packet, we have a good candidate for the ‘network is congested’ signal. Par- ticularly since this signal is delivered automatically by all existing networks, without special modification (as opposed to [14] which requires a new bit in the packet headers and a modification toall existing gateways to set this bit). The other part of a congestion avoidance strategy, the endnode action, is almost identical in thedec/iso scheme and our tcp7 and follows directly from a first-order time-series model of the network:8 Say network load is measured by average queue length over fixed intervals of some appropriate length (something near the round trip time). IfLi is the load at interval i, an uncongested network can be modeled by sayingLi changes slowly compared to the sampling time. I.e., Li = N (N constant). If the network is subject to congestion, this zeroth order model breaks down. The average queue length becomes the sum of two terms, theN above that accounts for the average arrival rate of new traffic and intrinsic delay, and a new term that accounts for the fraction of traffic left over from the last time interval and the effect of this left-over traffic (e.g., induced retransmits): Li = N + Li�1 (These are the first two terms in a Taylor series expansion of L(t). There is reason to believe one might eventually need a three term, second order model, but not until the Internet has grown substantially.) Arpanet, the loss sensitivity threshold is about 5\%. At this high loss rate, the empty window effect described above has already degraded throughput by 44\% and the additional degradation from the congestion avoidance window shrinking is the least of one’s problems. We are concerned that the congestion control noise sensitivity is quadratic in w but it will take at least another generation of network evolution to reach window sizes where this will be significant. If experience shows this sensitivity to be a liability, a trivial modification to the algorithm makes it linear in w. An in-progress paper explores this subject in detail. 7 This is not an accident: We copied Jain’s scheme after hearing his presentation at [9] and realizing that the scheme was, in a sense, universal. 8 See any good control theory text for the relationship between a system model and admissible controls for that system. A nice introduction appears in [20], chap. 8. When the network is congested, must be large and the queue lengths will start increasing exponentially.9 The system will stabilize only if the traffic sources throttle back at least as quickly as the queues are growing. Since a source controls load in a window-based protocol by adjusting the size of the window,W , we end up with the sender policy On congestion: Wi = dWi�1 (d < 1) I.e., a multiplicative decrease of the window size (which be- comes an exponential decrease over time if the congestion persists). If there’s no congestion, must be near zero and the load approximately constant. The network announces, via a dropped packet, when demand is excessive but says nothing if a connection is using less than its fair share (since the net- work is stateless, it cannot know this). Thus a connection has to increase its bandwidth utilization to find out the current limit. E.g., you could have been sharing the path with some- one else and converged to a window that gives you each half the available bandwidth. If she shuts down, 50\% of the band- width will be wasted unless your window size is increased. What should the increase policy be? The first thought is to use a symmetric, multiplicative in- crease, possibly with a longer time constant,Wi = bWi�1 , 1 < b � 1=d. This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput. The an- alytic reason for this has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover (what [17], chap. 2.1, calls therush-hour effect).10 Thus 9 I.e., the system behaves likeLi � Li�1 , a difference equation with the solution Ln = n L0 which goes exponentially to infinity for any > 1 . 10In fig. 1, note that the ‘pipesize’ is 16 packets, 8 in each path, but the sender is using a window of 22 packets. The six excess packets will form a queue at the entry to the bottleneck andthat queue cannot shrink, even though the sender carefully clocks out packets at the bottleneck link rate. This stable queue is another, unfortunate, aspect of conservation: The queue would shrink only if the gateway could move packets into the skinny pipe faster than the sender dumped packets into the fat pipe. But the system tunes itself so each time the gateway pulls a packet off the front of its queue, the sender lays a new packet on the end. A gateway needs excess output capacity (i.e.,� < 1 ) to dissipate a queue and the clearing time will scale like(1��)�2 ([17], chap. 2 is an excellent discussion of this). Since at equilibrium our transport connection ‘wants’ to run the bottleneck link at 100\% (� = 1 ), we have to be sure that during the non-equilibrium window adjustment, our control policy allows the gateway enough free bandwidth to dissipate queues that inevitably form due to path testing and traffic fluctuations. By an argument similar to the one used to show exponential timer backoff is necessary, it’s possible to show that an exponential (multiplicative) window increase policy will be ‘faster’ than the dissipation time for some traffic mix and, thus, leads to an unbounded growth of the bottleneck queue. 4 THE GATEWAY SIDE OF CONGESTION CONTROL 6 overestimating the available bandwidth is costly. But an ex- ponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, we’ll state that the best increase pol- icy is to make small, constant changes to the window size:11 On no congestion: Wi = Wi�1 + u (u � Wmax) whereW max is thepipesize(the delay-bandwidth product of the path minus protocol overhead — i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [14] and the policy we’ve implemented intcp. The only difference be- tween the two implementations is the choice of constants for d andu. We used 0.5 and 1 for reasons partially explained in appendix D. A more complete analysis is in yet another in-progress paper. The preceding has probably made the congestion control algorithm sound hairy but it’s not. Like slow-start, it’s three lines of code: � On any timeout, setcwnd to half the current window size (this is the multiplicative decrease). � On each ack for new data, increasecwnd by 1/cwnd (this is the additive increase).12 � When sending, send the minimum of the receiver’s advertised window andcwnd. Note that this algorithm isonly congestion avoidance, it doesn’t include the previously described slow-start. Since the packet loss that signals congestion will result in a re-start, it will almost certainly be necessary to slow-start in addition to the above. But, because both congestion avoidance and slow-start are triggered by a timeout and both manipulate the congestion window, they are frequently confused. They are actually independent algorithms with completely different objectives. To emphasize the difference, the two algorithms 11See [3] for a complete analysis of these increase and decrease policies. Also see [7] and [8] for a control-theoretic analysis of a similar class of control policies. 12This increment rule may be less than obvious. We want to increase the window by at most one packet over a time interval of lengthR (the round trip time). To make the algorithm ‘self-clocked’, it’s better to increment by a small amount on each ack rather than by a large amount at the end of the interval. (Assuming, of course, the sender has effectivesilly window avoidance (see [4], section 3) and doesn’t attempt to send packet fragments because of the fractionally sized window.) A window of sizecwndpackets will generate at mostcwndacks in oneR. Thus an increment of 1/cwnd per ack will increase the window by at most one packet in oneR. In tcp, windows and packet sizes are in bytes so the increment translates to maxseg*maxseg/cwndwheremaxsegis the maximum segment size andcwnd is expressed in bytes, not packets. have been presented separately even though in practise they should be implemented together. Appendix B describes a combined slow-start/congestion avoidance algorithm.13 Figures 7 through 12 show the behavior oftcp connec- tions with and without congestion avoidance. Although the test conditions (e.g., 16 KB windows) were deliberately cho- sen to stimulate congestion, the test scenario isn’t far from common practice: The Arpanetimp end-to-end protocol al- lows at most eight packets in transit between any pair of gateways. The default 4.3bsd window size is eight packets (4 KB). Thus simultaneous conversations between, say, any two hosts at Berkeley and any two hosts atmit would exceed the network capacity of theucb–mit imp path and would lead14 to the type of behavior shown. 4 Future work: the gateway side of congestion control While algorithms at the transport endpoints can insure the net- work capacity isn’t exceeded, they cannot insure fair sharing of that capacity. Only in gateways, at the convergence of flows, is there enough information to control sharing and fair allocation. Thus, we view the gateway ‘congestion detection’ algorithm as the next big step. The goal of this algorithm to send a signal to the endnodes as early as possible, but not so early that the gateway becomes 13We have also developed a rate-based variant of the congestion avoid- ance algorithm to apply to connectionless traffic (e.g., domain server queries, rpc requests). Remembering that the goal of the increase and decrease poli- cies is bandwidth adjustment, and that ‘time’ (the controlled parameter in a rate-based scheme) appears in the denominator of bandwidth, the algorithm follows immediately: The multiplicative decrease remains a multiplica- tive decrease (e.g., double the interval between packets). But subtracting a constant amount from interval doesnot result in an additive increase in bandwidth. This approach has been tried, e.g., [18] and [24], and appears to oscillate badly. To see why, note that for an inter-packet intervalI and decrementc, the bandwidth change of a decrease-interval-by-constant pol- icy is 1 I ! 1 I � c a non-linear, and destablizing, increase. An update policy that does result in a linear increase of bandwidth over time is Ii = �Ii�1 � + Ii�1 whereIi is the interval between sends when theith packet is sent and� is the desired rate of increase in packets per packet/sec. We have simulated the above algorithm and it appears to perform well. To test the predictions of that simulation against reality, we have a cooperative project with Sun Microsystems to prototyperpc dynamic congestion control algorithms usingnfs as a test-bed (sincenfs is known to have congestion problems yet it would be desirable to have it work over the same range of networks astcp). 14did lead. 4 THE GATEWAY SIDE OF CONGESTION CONTROL 7 starved for traffic. Since we plan to continue using packet drops as a congestion signal, gateway ‘self protection’ from a mis-behaving host should fall-out for free: That host will simply have most of its packets dropped as the gateway trys to tell it that it’s using more than its fair share. Thus, like the endnode algorithm, the gateway algorithm should reduce congestion even if no endnode is modified to do congestion avoidance. And nodes that do implement congestion avoid- ance will get their fair share of bandwidth and a minimum number of packet drops. Since congestion grows exponentially, detecting it early is important. If detected early, small adjustments to the senders’ windows will cure it. Otherwise massive adjust- ments are necessary to give the net enough spare capacity to pump out the backlog. But, given the bursty nature of traffic, reliable detection is a non-trivial problem. Jain[14] proposes a scheme based on averaging between queue regen- eration points. This should yield good burst filtering but we think it might have convergence problems under high load or significant second-order dynamics in the traffic.15 We plan to use some of our earlier work onarmax models for round-trip-time/queue length prediction as the basis of de- tection. Preliminary results suggest that this approach works well at high load, is immune to second-order effects in the traffic and is computationally cheap enough to not slow … An Empirical Study of Collusion Behavior in the Maze P2P File-Sharing System Qiao Lian1, Zheng Zhang1, Mao Yang13, Ben Y. Zhao2, Yafei Dai3, Xiaoming Li3 Microsoft Research Asia, Beijing, China1 U. C. Santa Barbara, Santa Barbara, CA, USA2 Peking University, Beijing, China3 Abstract— Peer-to-peer networks often use incentive policies to encourage cooperation between nodes. Such systems are generally susceptible to collusion by groups of users in order to gain unfair advantages over others. While techniques have been proposed to combat web spam collusion, there are few measurements of real collusion in deployed systems. In this paper, we report analysis and measurement results of user collusion in Maze, a large-scale peer-to-peer file sharing system with a non-net-zero point-based incentive policy. We search for colluding behavior by examining complete user logs, and incrementally refine a set of collusion detectors to identify common collusion patterns. We find collusion patterns similar to those found in web spamming. We evaluate how proposed reputation systems would perform on the Maze system. Our results can help guide the design of more robust incentive schemes. I. INTRODUCTION File-sharing networks such as Kazaa and Gnutella have popularized the peer-to-peer (P2P) resource sharing model. In these networks, selfish users often “free-ride,” or act in their self interests to exploit the system. Prior research has focused on the use of incentive systems [11] to encourage sharing among users. Despite their effectiveness, these incentives are generally vulnerable to user collusion along with the Sybil Attack [4], where users create and control multiple online identities. Very little is known about user collusion in real systems. Measuring user collusion is extremely difficult, since it re- quires a global view of all transactions in a system. To date, most user studies log traffic at the edge nodes while per- forming queries or membership operations [13], [12]. While informative, these results reveal only a small subset of peer transactions, and do not shed light on active user collusion. In this paper, we present measurements of user collusion activity in the Maze peer-to-peer file-sharing network [17]. Maze is a popular Napster-like P2P network designed, im- plemented and deployed by an academic research team at Peking University, Beijing China. As a measurement platform, Maze is unique in two ways. First, our control over the Maze software allows us to deploy and embed customized measurement code inside clients. Second, Maze’s centralized architecture means all control and query traffic is logged and available to us. Maze uses a simple incentive system where user points increase with uploads and decrease with downloads. Our central server audits file transfers and adjusts user points accordingly. In our study, we define collusion as collaborative activity of a subset of users that grants its members benefits they would not be able to gain as individuals. We note that we cannot determine a user’s true intent, or definitively whether multiple online identities belong to the same user. Issues such as dynamic addressing via DHCP and shared addresses behind NATs prevent us from reliably detecting the use of multiple virtual identities. Instead of measuring user intent, our study focuses purely on observable action patterns that produce results similar to those produced by colluding users. Quantifying all forms of collusion is a topic to be addressed in ongoing work. To the best of our knowledge, this is the first empirical study of collusion behavior in an incentive-based P2P system. Our results validate conclusions of previous work on incentive mechanisms [3], [8], [7], but also show that certain types of collusion are difficult to detect and deter. Our work makes several key contributions. First, we describe techniques to aggregate logs and construct collusion detectors that pro- gressively reveal collusion behaviors and their patterns. We believe that these techniques have general applicability to other experimental settings when detecting collusions. Second, we find that while some behaviors are linked to our incentive system, their patterns are nearly identical to those found in Web spamming, albeit at smaller scales. Finally, we evaluate the ability of the EigenTrust reputation system [6], [7] to detect and mark colluders with lower reputations. Our results show that under several collusion patterns, EigenTrust produces non- ideal reputations for both well-behaved and colluding peers. The rest of the paper is organized as follows. Section 2 gives a brief description of Maze and the measurement data used for this study. Next, Section 3 describes each of our collusion detectors in detail, along with results from the Maze dataset. Then in Section 4, we apply the EigenTrust algorithm to the Maze dataset and analyze the results. Finally, we discuss related work in Section 5 and conclude in Section 6. II. THE MAZE PEER-TO-PEER SYSTEM We begin by providing necessary background information on the Maze file-sharing system. Maze was originally deployed to address issues of data location and load-balancing on FTP servers as part of the T-Net Web search project [14]. T-Net’s increasing popularity led to significantly degraded download performance across its limited number of FTP servers. Maze Fig. 1. The Maze point system provided a way to distribute content while minimizing infras- tructure costs. At login, each Maze peer1 authenticates to a central server, and uploads an index of its locally available shared files. The central server maintains heartbeats with all online peers, and its central index supports full-text queries across all shared files in the system. In addition to searching for files via the central index, users can browse three peer lists for files: a friend-list, a neighborhood-list, and an altruistic list. The friend-list is a user-controlled list of friendly peers, initially bootstrapped as a random set of peers by the central server. Over time, these friend-lists across users form a con- tinuously adapting social network. A peer’s neighborhood list is a set of peers with the same B-class IP address provided by the server. These are peers likely to share high-bandwidth, low latency links with the local host. Finally, the altruistic list is a collection of peers with the highest “points” as determined by the server. They are hosts who have contributed the most to the system as determined by the Maze incentive system. Their status as “celebrities” in the user population provides additional social incentive for sharing [17]. A peer can browse and download the storage contents of any host on these lists. As of November 2004, more than two- thirds of all downloads are initiated through these peer lists. As will become clear later, these social lists have unexpected impacts on the design of a good incentive system. After each transaction, peers involved send a report to the central server, which adjusts their points accordingly. A. The Maze incentive system Maze currently operates using a point system, where peers consume points by downloading files and earn points by uploading files. Download requests are queued according to their points: requestT ime−3 log P , where P is the requester’s point total. Frequent uploads provide peers with higher points and faster downloads. The Maze user community voted for an asymmetric point system where uploading receives more points than downloading the same amount of data. While this 1We use the terms “user” and “peer” interchangeably in this paper. We also use “clients” of peer X to refer to the peers that download from X. encourages uploads, it also allows two peers to create a net gain in points after mutual interaction. To allow new users to participate, Maze initializes new users with enough points to download > 1GB of data before its downloads are throttled. The details of the points system are as follows: 1) New users are initialized with 4096 points. 2) Uploads: +1.5 points per MB uploaded. 3) Points used per file downloaded: • -1.0/MB downloaded within first 100MB. • -0.7/MB per additional MB between 100MB and 400MB. • -0.4/MB between 400MB and 800MB. • -0.1/MB per additional MB over 800MB. 4) Service differentiation: • Requests are ordered by T = requestT ime − 3 log P , where P is the requester’s point total. • Users with P < 512 are limited to 200Kb/s. B. Data collection We perform our analysis on a log segment gathered during the span of a one-month period from 2/19/05 to 3/24/05. During this period, more than 161,000 users participated in more than 32 million file transfers totaling more than 437 Terabytes. Data gathered for this study includes user point values and the detailed traffic log. Each log entry contains the following: uploading peer-id, downloading peer-id, upload time, transfer start and end times (source), bytes transferred, file size, downloader IP, file MD5 hash, and full file path. We tried to associate online identities with the physical machine the peer uses to detect when a user was controlling multiple identities. We first used the hash of the hard drive serial number, but later discovered that the serial number is not guaranteed to be unique. Thus to uniquely identify the machine that a peer uses, we concatenate the peer’s IP address with the hash of the hard drive serial number. As ongoing work, we are investigating the use of network MAC addresses as an alternative identifier. All logs are anonymized for user privacy. In this paper, we refer to distinct users using common names (e.g. Alice and Bob), and random alphabetic letters to represent 8-bit blocks of an IP address (e.g. C.H.97.140). III. IDENTIFYING COLLUSION TOPOLOGIES We now discuss our efforts to detect collusion attempts in the Maze system. Based on our experiences and analysis of the traffic logs, we design a number of collusion detectors aimed at locating different types of collusion patterns. We describe these in detail in this section. A. Repetition-based Collusion Detection Our first attempt starts by looking at how users use uploads to generate points. Given the point system generates a net gain from a symmetric operation, colluders can benefit from using only a small “working set” of files to generate points. We use this assumption to generate our first collusion detector. Fig. 2. Duplication degree in uploading from peer A to peer B Detector 1: (Repetition detector) Colluders generate large amounts of upload traffic with repeated content. We examine our transaction log and construct a large graph, where vertices represent individual users and edges represent aggregated file transfers between users. This results in a directed graph with roughly 4.5 million individual edges. Out of all edges, 221,000 (roughly 4.9\%) contain duplicate files in the transfer traffic. We define duplication degree as the ratio of total upload traffic (bytes) to amount of non-duplicated data (bytes). A high duplication degree means a small proportion of traffic across the link is unique. We plot the duplication degree of all edges against their total upload traffic as a scatter plot in Figure 2. For thresholds set at duplication degrees of 5, 10 and 20, there are 890, 148, 27 edges with duplication degree greater than each threshold. Since this data is generated from activities performed over the span of a month, it is highly likely that many of these peers are actively colluding. We also note that colluders are likely to use nearby machines to perform the transfers. Such network locality maximizes throughput and gain from collusion. We show this in Figure 2: Duplication degree in uploading from peer A to peer B by classifying edges by the IP affinity between the two peers. IP affinity confirms that edges with high amounts of duplicate traffic are likely to be across peers with similar IP addresses. To better understand this behavior, we take a closer look at the temporal distribution of duplicate traffic by individual users. Table I lists the top-6 edges with the most duplicate traffic. The table shows each user’s total uploads, and uploads on the edge with the most repeat traffic (max edge). Each table entry also includes a temporal locality graph. Each bar stands for one day, and the height of the bar is proportional to that day’s upload traffic. These results show that there is strong temporal locality present. If the same file is uploaded multiple times close in time, then it is more likely to be used as a colluding tool than legitimate sharing. TABLE I TOP 6 EDGES WI TH THE MOS T REDUNDANT TRAFFIC Fig. 3. Collusion link topology of 100 links with the highest ratio of duplicate transfers The temporal locality provides strong evidence that all 5 of these peers are colluding aggressively. The maximum duplication degree is close to 43 by Cindy. David colludes with two different peers, with non-overlapping temporal behavior. Our data shows that the data transferred during these colluding sessions are generally large files or directories. For example, Alice uploaded the a 3GB DVD image 29 times. To better understand collusion topologies, we built a vi- sualization tool that draws edges with the highest ratios of duplicate traffic. Figure 3 gives a snapshot of the top-100 duplicate traffic links. This figure shows collusion patterns graphically. In addition to pair-wise collusions that exploit the net-gain in points from transfers, we also observe more complex 3-party and star-shaped topologies. We examine these complex collusion behaviors in detail in the next two sections. B. Group-based Collusion Detection We now turn our attention to mutually colluding peers. In Maze, group collusion occurs when peers exchange large amount of data among themselves to earn points. This is a consequence of the asymmetric point assignment in Maze. If Fig. 4. Pair-wise collusion detector by the ratio of mutual upload traffic over total traffic two peers upload 10GB data to each other, each of them will acquire at least 5000 points. The asymmetric point system was a result of extensive discussions and voting on the Maze user forum, where users wanted to encourage uploading. Our data shows that most group collusion are pairs, and groups of three or more are rare. Intuitively, the traffic pattern for a pair-wise colluding group is where the traffic between two peers outweigh their traffic with outside peers. Note that the total traffic for a peer measured through our log segment is a rough granularity approximation of a peer’s download/upload rate. We define the metric pair-wise degree as the ratio of total traffic between two peers to the sum of all traffic uploaded by both peers. We use this as our first group collusion detector: Detector 2: (Pair-wise detector) high rate of mutual upload traffic compared to total uploads. Figure 4 shows the results of applying this detector to our dataset. There are 28,000 pairs of peers with mutual uploads, each plotted as a single point with total uploads on the x-axis and pair-wise degree on the y-axis. A horizontal line is plotted for pair-wise degree equal to 0.5. Above that line are 73 pairs of peers whose pair-wise upload exceeds uploads to external peers. While it is possible for two friends to share large amounts of mutually interesting content, the highly concentrated nature of these uploads appear to indicate collusion. Regardless of whether the users intend to collude, this type of behavior results in artificially inflated point values for peers who are not contributing to the community at large. One impediment to effective collusion of any kind is con- nectivity. It is tedious to transfer large amounts of data through a narrow pipe just for collusion. While pair-wise collusion across the wide-area is possible, signing on as a new user (whitewashing [18]) is an easier alternative to replenish a peer’s points. Thus, we assert that pair-wise collusion requires good connectivity between peers. To verify this, we show the IP affinity of peer pairs in Figure 4 with different symbols. We see that most colluding peers have similar IP addresses, TABLE II TOP-3 MUTUAL UP LOAD PAI RS . PEER 1/EXTERNAL S HOWS ALL TRAFFIC F ROM PEER 1 NOT GOI NG TO 2. Peer 1 Peer 1 Mutual upload2 Peer 2 Peer 2 external external 1.7GB Fred 24GB←→ 23GB Gary 5GB 23GB Cindy 81GB←→ 27GB Harry 0GB 52GB David 62GB←→ 126GB Alice 32GB implying that the peers are likely physically close in the network and likely to be connected using a high bandwidth connection. We take a closer look at specific examples of possible colluding peer-pairs. Table II lists the top 3 pairs ranked by pair-wise degree. Given the asymmetric point system, even the most unbalanced peer (Harry) will end up with a net point gain (after uploading 27GB and then downloading 81GB). C. Spam Account Collusion In our repetition-based topology in Figure 3 that showed 100 links with the highest ratio of duplicate transfers, we observed an unexpected colluding topology: the star-shaped colluding group. The center of the star gains points by uploading to many other peers without downloading anything in return. This seems contrary to behavior observed in our previous colluding patterns. The willingness of these “leaf peers” to download duplicate content from the central peer without personal gain is puzzling. In reality, all peers in this topology are controlled by the same user, and all peers collaborate to increase the point total of the center peer. We call these leaf-peers spam accounts, since they are created and then discarded when they expend their initial point allocations2. This strategy is similar to the link spam [16] problem in search engines using page rank to sort results. One might ask why a user would prefer this strategy to performing whitewashing (restarting with new identities). One explanation is that users need to maintain one persistent primary account either for social status from higher point values or to maintain a persistent friends-list. This strategy is also efficient because it earns points much faster than pair-wise collusion for the same amount of traffic. For this to work, however, the colluder must use multiple machines. There are 4 star-shaped topologies caught by the repetition detector in Figure 3. Ted has fan-out of 8, Mary and Sam have fan-out of 4, and Ingrid has fan-out of 3. We take a closer look at them in Table III. Except for Ted’s group, there is generally strong IP address proximity between the center peer and its leaf-peers. All of this indicates a high likelihood of collusion. Peer Ted, however, turns out to be the Maze user with the highest uploads of the month (3.8TB). Since its 8 edges carrying duplicate traffic shows very little IP address similarity, Ted is likely not a colluder. While zero-cost identities are easy to generate, physically separate machines are expensive to obtain. This means spam 2Note that spam accounts can be produced by performing the Sybil [4] attack. TABLE III PEERS S US P I CI OUS OF DOI NG S PAM ACCOUNT COLLUDI NG, AS F OUND BY REP ETI TI ON DETECTOR accounts can be many in number, but are likely to reside on few machines. We define the PM ratio as the ratio of number of peers to number of machines to describe how densely a peer’s clients are distributed across different physical machines. Detector 3: (Spam account detector) high Peer to Machine ratio can indicate spam account collusion. We use the method described in Section II-B to associate a peer with its machine. One problem with the PM value is the signal to noise ratio. A single upload to a random peer counts as an additional peer-machine pair, and significantly reduces the PM value. We remove these noisy values by discarding the bottom smallest uploads that, in aggregate, holds less than 20\% of all upload traffic. Figure 5 plots all peers as points with the total uploads made by the peer on the X-axis, and the peer’s PM ratio on the Y-axis. Most peers have PM ratio slightly above 1 and below 2. This is statistically normal because many users perform whitewashing. However, a number of peers have exceptionally high PM ratios (up to 7). These peers are likely machines generating new user accounts to help a peer collude. Table IV lists the peers whose upload > 10GB and have PM ratio > 3. The temporal column shows when each client generated its peak loads of Maze traffic. Consistent temporal collisions between virtual nodes on the same machine may signal collusion. We check for three additional properties of likely colluders. First, we use IP address proximity to infer whether these accounts have good connectivity to the center peer as expected. Fig. 5. Spam account detection by PM ratio Second, we check if these accounts only download data from the center peer. Finally, we examine whether the spam accounts perform a large amount of downloads in a relatively short life-span. All of these heuristics confirm the likelihood that these peers are colluding. Most spam accounts live on the same machine; they generally download exclusively from the center peer; and they are only active for short life spans (1-2 days). D. Upload Traffic Concentration Pair-wise colluding and spam account colluding share one common trait: a high volume of uploads to a small number of machines. We now shift our focus from the flow among peers to among physical machines. We define the traffic concentration degree (TC degree), as the ratio of a peer’s highest upload traffic to a single machine to his total upload traffic. For instance, if X uploads to 10 clients for a total of 100GB, and with 90GB going to one machine, then the TC degree of X is 0.9. The higher the TC degree, the more likely that the peer is performing either pair-wise or spam-account colluding. Detector 4: (Traffic concentration detector) peers with exceptionally high TC degree. We plot the results from this detector in Figure 6 where we plot the total upload of each peer on the X-axis and its TC degree on the Y-axis. For non-colluding users, we expect an increase in upload traffic to correlate with more destinations and thus a lower TC degree. Figure 6 confirms this in our data set. Most peers who upload around 10GB have TC degrees around 10\%. For heavy uploaders who upload around 1TB, the TC degree drops to about 1\%. We identify a number of potential colluders who have TC degrees significantly higher than their peers. For example, seven peers have uploaded more than 50GB and have TC degrees higher than 0.6. This means that more than 60\% of the 50GB uploaded is going to a single machine. We list these TABLE IV SOME TOP S PAM ACCOUNT COLLUDERS TABLE V TOP 7 COLLUDERS DETECTED BY TC DETECTOR peers and their traffic in Table V. Six of these have been detected by previous detectors. The pair-wise detector missed the new peer Nancy because it has no pair-wise traffic with any other peer. The spam account detector missed Nancy because it mainly uploads to only one peer and its PM ratio is close to one. It turns out that it ranks #7 by the repetition detector (we listed only the top 6 in Section 3.1). E. Detectors Compared After presenting four different collusion detectors, we sum- marize the top colluders discussed in earlier sections in Fig- ure 7, and graphically show how they were detected by each Fig. 6. Upload traffic concentration detector result TABLE VI SEVEN TOP COLLUDERS AND HOW OUR DETECTORS HAVE F OUND AND MI S S ED THEM of our four detectors. Table VI lists the top seven colluders according to their total upload traffic. The detectors responsi- ble for detecting each colluder is shown in shaded cells. Our first observation is that spam-account and pair-wise colluders do not overlap. This is logical, because the two detectors are designed specifically with these two patterns in mind. While colluders can engage in both activities simultaneously to evade potential detectors, the fact that spam-account colluding is more “cost effective” and that these detectors were applied to logs after the fact, leads us to believe that this does not happen in practice. As discussed earlier, the traffic concentration detector relies on the observation that colluders generally control relatively few machines. Thus it examines how a peer’s upload traffic is spread across its partners. Figure 7 shows that this detector detects the majority of both spam-account and pair-wise col- luders. We now examine the colluders that TC degree failed to detect. With the exception of “Bob,” who was missed due to a low TC degree (0.39), the other colluders it missed all have high TC degrees (>0.8), and avoided detection only because their total upload traffic was low. Lowering our traffic TABLE VII SUMMARY OF S TRENGTHS AND WEAKNES S ES OF COLLUS I ON DETECTORS. Detector Heuristic Strength Weakness Colluders Small colluding working set General Cannot detect randomized colluding group Pair-wise Uploads: pair-wise > external Accurate Limited to pair-wise topologies Spam One uploads to many Accurate Limited to spam collusion topologies Traffic Concentration Colluder controls few hosts General Difficult to optimize parameters Fig. 7. Venn diagram of collusion detectors threshold from the current value of 50GB would allow us to detect all of our colluders, showing that the TC detector is in fact quite effective. However, because we still cannot guarantee perfect results when mapping peers to specific machines (in the presence of DHCP and NATs), we cannot yet deploy the TC detector as an online collusion detector. Finally, we are still investigating how to choose reasonable traffic thresholds to avoid false positives. Table VI shows that the repetition detector misses six colluders. All six peers have too small redundant traffic on a single link to be noticed. For example, peers Larry and Jane have a lot of collusion traffic, but their traffic is scattered across multiple upload links. The repetition can only be found if we aggregate multiple links together (Figure 3). The pair- wise detector also missed four colluders. Two among the four colluders have little number of mutual upload with other peers. The other two have no mutual upload with any peer at all. Spam account detector missed five colluders. All the five colluders evaded the spam account detector because they have very low PM ratios. Figure 7 shows that the repetition detector also works quite well. However, the reason that it works at all is because the current version of Maze has no explicit defense mechanism against collusion. This detector can easily be circumvented by a colluder if it simply modifies the content slightly, even by just flipping one single bit. Also, differentiating legitimate repeated downloads from colluders will be a challenging task. For example, peers could lose their local cache and be required Fig. 8. The basic EigenTrust algorithm for computing trust. to repeat previous downloads. We have used this detector in the study to lead the ways to other more robust detectors, taking advantage of the very fact that colluders today do not bother to cover their tracks by randomizing their colluding working set. We summarized the four detectors in Table VII. IV. EIGENTRUST AND COLLUSION Our access to a complete view of all transactions in Maze gives us a unique opportunity to evaluate how proposed reputation systems from research would perform on real world systems. In this section, we will evaluate EigenTrust [6], a well-known reputation system that generatess a global ranking. The EigenTrust ranking can be used for both reputation man- agement [6] (clients choose trustworthy download sources), and free-rider detection [7] (uploaders choose trustworthy clients). Ideally, the ranking algorithm would assign low scores to malicious colluders. A. An Overview of EigenTrust We first give a high level description of the EigenTrust system [6]. EigenTrust calculates global trust values for all peers based on Power iteration in peer-to-peer file-sharing systems. The algorithm is similar to the PageRank algorithm. First, peer i assigns peer j trust values Cij based on its downloading experience from j. Trust values for all j are normalized locally by each peer i. We obtain from this a matrix C containing a measure of trust across all peer pair in the network. The trust vector t is defined as the left principal eigenvector of C. The component ti is called the EigenRank of peer i, and represents the peer’s global reputation. The basic algorithm can be further improved to enhance its robustness against malicious users. This is done by incorpo- rating the notion of pre-trusted peers in the set P . So, for peer i, we define pi = 1/|P | if i ∈ P , and pi = 0 otherwise. The algorithm is summarized in Figure 8. Parameter α is a constant less than 1 that is used to control the level of trust Fig. 9. The EigenRank of peers each peer places on the pre-trusted peers. A higher value of α implies more confidence on the pre-trusted peers. B. Applying EigenTrust to Maze We map the EigenTrust algorithm to Maze system as follows. First, we define the trust value cij be proportional to the total traffic peer i downloads from peer j during the log period. We then normalize the local trust value cij such that: ∑N j=1 cij = 1. Next, we select 10 pre-trusted peers from the Maze population by choosing ten well-known users who we have … Analysis of the Increase and Decreas, e Algorithms for Congestion Avoidance in Computer Networks D a h - M i n g C H I U a n d Raj J A I N Digital Equipment Corporation, 550 King Street (LKG1-2 /,419), Littleton, MA 01460-1289, U.S.A. 1. I n t r o d u c t i o n 1.1. Background Abstract. C o n g e s t i o n a v o i d a n c e m e c h a n i s m s allow a n e t w o r k to o p e r a t e in t h e o p t i m a l region o f low delay a n d h i g h t h r o u g h p u t , thereby, p r e v e n t i n g t h e n e t w o r k f r o m b e c o m i n g c o n g e s t e d . T h i s is d i f f e r e n t f r o m the t r a d i t i o n a l c o n g e s t i o n c o n t r o l m e c h a n i s m s t h a t allow t h e n e t w o r k to recover f r o m the c o n g e s t e d s t a t e o f h i g h delay a n d low t h r o u g h p u t . Both c o n - g e s t i o n a v o i d a n c e a n d c o n g e s t i o n c o n t r o l m e c h a n i s m s are basi- cally resource m a n a g e m e n t p r o b l e m s . T h e y c a n be f o r m u l a t e d as s y s t e m c o n t r o l p r o b l e m s in which t h e s y s t e m s e n s e s its state a n d feeds this back to its users w h o a d j u s t their controls. T h e key c o m p o n e n t of a n y c o n g e s t i o n a v o i d a n c e s c h e m e is t h e a l g o r i t h m (or c o n t r o l f u n c t i o n ) u s e d b y the users to in- crease o r decrease their load ( w i n d o w o r rate). W e a b s t r a c t l y characterize a wide class o f s u c h i n c r e a s e / d e c r e a s e a l g o r i t h m s a n d c o m p a r e t h e m u s i n g several different p e r f o r m a n c e metrics. T h e y key metrics are efficiency, fairness, c o n v e r g e n c e time, a n d size of oscillations. It is s h o w n t h a t a s i m p l e additive i n c r e a s e a n d multiplicative d e c r e a s e a l g o r i t h m satisfies t h e sufficient c o n d i t i o n s for c o n - v e r g e n c e to a n efficient a n d fair state regardless o f the s t a r t i n g s t a t e o f the network. T h i s is t h e a l g o r i t h m finally c h o s e n for i m p l e m e n t a t i o n in t h e c o n g e s t i o n a v o i d a n c e s c h e m e r e c o m - m e n d e d for Digital N e t w o r k i n g A r c h i t e c t u r e a n d OSI T r a n s - p o r t C l a s s 4 N e t w o r k s . Keywords. C o m p u t e r N e t w o r k , N e t w o r k P e r f o r m a n c e , Re- s o u r c e M a n a g e m e n t , C o n g e s t i o n C o n t r o l , C o n g e s t i o n Avoi- d a n c e , Flow C o n t r o l , Fairness. N o r t h - H o l l a n d C o m p u t e r N e t w o r k s a n d I S D N S y s t e m s 17 (1989) 1 - 1 4 Congestion in computer networks is becoming an important issue due to the increasing mismatch in link speeds caused by intermixing of old and new technology. Recent technological advances D a h - M i n g C h i u received the B.Sc. de- gree w i t h first class h o n o u r s f r o m I m - perial College o f Science a n d T e c h n o l - ogy, L o n d o n U n i v e r s i t y , in 1975, a n d t h e M.S. a n d Ph.D. degrees f r o m H a r v a r d U n i v e r s i t y , C a m b r i d g e , MA, in 1976 a n d 1980 respectively. F r o m 1979 to 1980, he was with A T & T Bell L a b o r a t o r i e s , where he worked o n a p p l y i n g q u e u i n g t h e o r y to n e t w o r k m o d e l i n g . Since 1980, he h a s b e e n w i t h Digital E q u i p m e n t C o r p o - ration, where he h a s worked o n per- f o r m a n c e m o d e l i n g a n d a n a l y s i s of c o m p u t e r s y s t e m s a n d n e t w o r k s . C u r r e n t l y , he is a m e m b e r o f the D i s t r i b u t e d Sys- t e m s A r c h i t e c t u r e g r o u p , where he is working o n the n a m e service architecture a n d a n a l y z i n g v a r i o u s d i s t r i b u t e d algo- rithms. His research i n t e r e s t s include o p e r a t i o n s y s t e m s , dis- t r i b u t e d s y s t e m s , c o m p u t e r networks, p e r f o r m a n c e analysis, a n d o p t i m i z a t i o n theories. Dr. C h i u is a m e m b e r o f the A C M a n d the IEEE. Raj Jain received the B.E. degree f r o m A.P.S. University, Rewa, India, t h e M.E. degree f r o m I n d i a n I n s t i t u t e of Science, Bangalore, India, a n d the Ph.D. degree f r o m H a r v a r d U n i v e r - sity, C a m b r i d g e , M A , in 1972, 1974, a n d 1978, respectively. H i s P h . D . d i s s e r t a t i o n , e n t i t l e d C o n t r o l T h e o r e t i c F o r m u l a t i o n o f O p e r a t i n g S y s t e m s R e s o u r c e M a n a g e - m e n t Policies, was p u b l i s h e d by G a r - l a n d Publishing, Inc. o f N e w Y o r k in their O u t s t a n d i n g D i s s e r t a t i o n s in t h e C o m p u t e r Sciences series. Since 1978, he h a s b e e n with Digital E q u i p m e n t C o r p o r a t i o n , where he h a s b e e n involved in p e r f o r m a n c e m o d e l i n g a n d a n a l y s i s o f a n u m b e r o f c o m p u t e r s y s t e m s a n d n e t w o r k s i n c l u d i n g V A X Clusters, D E C n e t , a n d E t h e r n e t . C u r r e n t l y , he is a C o n s u l t i n g E n g i n e e r in the Distrib- u t e d S y s t e m s A r c h i t e c t u r e a n d P e r f o r m a n c e G r o u p . He s p e n t t h e 1 9 8 3 - 1 9 8 4 a c a d e m i c y e a r o n a s a b b a t i c a l at the M a s - s a c h u s e t t s I n s t i t u t e of T e c h n o l o g y d o i n g research on t h e pei-- f o r m a n c e o f n e t w o r k s a n d local area s y s t e m s . F o r three y e a r s he a l s o t a u g h t a g r a d u a t e c o u r s e on c o m p u t e r s y s t e m s perfor- m a n c e t e c h n i q u e s at M I T a n d is writing a t e x t b o o k o n this subject, to be p u b l i s h e d by Wiley-Interscience. Dr. J a i n is a m e m b e r o f t h e A s s o c i a t i o n for C o m p u t i n g M a c h i n e r y , a n d a s e n i o r m e m b e r o f IEEE. 0 1 6 9 - 7 5 5 2 / 8 9 / $ 3 . 5 0 © 1989, Elsevier Science P u b l i s h e r s B.V. ( N o r t h - H o l l a n d ) 2 D.-M. Chiu, R. Jain / Congestion Avoidance in Computer Networks Knee Cliff ; , Throu- ~ ghput I Load ! RoeSsP- j * time , , I l l . - Load Fig. 1. Network performance as a function of the load. Broken curves indicate performance with deterministic service and interarrival times. such as local area networks (LANs) and fiber optic LANs have resulted in a significant increase in the bandwidths of computer network links. However, these new technologies must coexist with the old low bandwidth media such as the twisted pair. This heterogeneity has resulted in a mis- match of arrival and service rates in the inter- mediate nodes in the network causing increased queuing and congestion. Traditional congestion control schemes help improve performance after congestion has oc- curred. Figure 1 shows general patterns of re- sponse time and throughput of a network as the network load increases. If the load is small, throughput generally keeps up with the load. As the load increases, throughput increases. After the load reaches the network capacity, throughput stops increasing. If the load is increased any fur- ther, the queues start building, potentially result- ing in packets being dropped. Throughput may suddenly drop when the load increases beyond this point and the network is said to be congested. The response-time curve follows a similar pattern. At first the response time increases little with load. When the queues start building up, the re- sponse time increases linearly until finally, as the queues start overflowing, the response time in- creases drastically. The point at which the packets start getting lost is called a cliff due to the fact that the throughput falls off rapidly after this point. We use the term knee to describe the point after which the increase in the throughput is small, but when a significant increase in the response time results. A scheme that allows the network to operate at the knee is called a congestion avoidance scheme, as distinguished from a congestion control scheme that tries to keep the network operating in the zone to the left of the cliff. A properly designed congestion avoidance scheme will ensure that the users are encouraged to increase their traffic load as long as this does not significantly affect the response time, and are required to decrease them if that happens. Thus, the network load oscillates around the knee. Both congestion avoidance and congestion con- trol mechanisms are dynamic resource manage- ment problems that can be formulated as system control problems in which the system senses its state and feeds this back to its users who adjust their controls. For the congestion problem, the state consists of the load on the network and the control is the number of packets put into the network by the users. Often a window mechanism is used in the transport layer protocols to limit the number of packets put into the network. An alter- native mechanism consists of setting a limit on the rate (packets per second or bits per second) that can be sent by a user. In either case, the control (window or rate) can be dynamically adjusted as the total load on the system changes. This control, which we call the increase/decrease algorithm, is at the heart of all congestion avoidance mecha- nisms. We have investigated a number of congestion avoidance mechanisms, reported in a series of papers, and this paper is a part of that series [7,8,10,11]. The concept of congestion avoidance and several alternatives are described in [7]. We chose a particular alternative called the binary feedback scheme which is described in detail in [11]. This scheme is later extended in [10] to include a selective feedback mechanism in which the routers monitor different users and permit some users to increase load while requesting others to decrease load. All of our work on congestion avoidance is summarized in [8]. D.-M. Chiu, R. Jain / Congestion Avoidance in Computer Networks T h i s p a p e r c o n c e n t r a t e s o n a detailed analysis o f the i n c r e a s e / d e c r e a s e algorithms. This analysis resulted in the selection o f the i n c r e a s e / d e c r e a s e a l g o r i t h m s used in the b i n a r y f e e d b a c k s c h e m e p r o p o s e d in [11] a n d [10]. H o w e v e r , the analysis p r e s e n t e d here is general a n d applies to m a n y o t h e r a p p l i c a t i o n s besides congestion avoidance. Briefly, the b i n a r y f e e d b a c k s c h e m e f o r conges- tion a v o i d a n c e o p e r a t e s as follows. T h e resources in the n e t w o r k m o n i t o r their usage a n d d e t e r m i n e if t h e y are l o a d e d b e l o w o r a b o v e a n o p t i m a l l o a d level. D e p e n d i n g u p o n the l o a d level, the resource sends a b i n a r y f e e d b a c k (1 = overloaded, 0 = u n d e r l o a d e d ) to the users w h o then adjust their l o a d using a n i n c r e a s e / d e c r e a s e algorithm. This b i n a r y f e e d b a c k is sent b y setting a bit in the p a c k e t header. T h e use o f a bit in the p a c k e t h e a d e r as a f e e d b a c k m e c h a n i s m has b e e n incor- p o r a t e d into the O S I connectionless n e t w o r k i n g p r o t o c o l s t a n d a r d s [4]. T h e bit is called a c o n g e s - tion experienced b i t a n d is a p a r t o f a field called q u a l i t y o f service in the n e t w o r k layer header. T h e a b s t r a c t m o d e l a s s u m e s t h a t all the users sharing the s a m e b o t t l e n e c k will receive the s a m e feedback. Based o n this feedback, the users try to a d j u s t their l o a d so t h a t the b o t t l e n e c k is effi- ciently used as well as equally shared b y all users. I n this a b s t r a c t e d context, we a s s u m e that the f e e d b a c k a n d c o n t r o l l o o p for all users is s y n c h r o - nous, t h a t is, all users receive the s a m e f e e d b a c k a n d react to it; the n e x t f e e d b a c k is t h e n gener- a t e d a f t e r all users h a v e reacted to the f e e d b a c k a n d so on. Also, we c o n c e n t r a t e o n o n e b o t t l e n e c k resource a n d the users t h a t share it. Because o f these abstractions, we are able to d e m o n s t r a t e s o m e o f the subtle b e h a v i o r o f this t y p e o f al- g o r i t h m . T h e results p r e s e n t e d here were verified b y detailed simulations o f real n e t w o r k s [7,10,11]. A t the o t h e r end o f the s p e c t r u m , we h a v e de- centralized d e c i s i o n - m a k i n g . I n this case the deci- sions are m a d e b y the users while the resources feed i n f o r m a t i o n regarding c u r r e n t resource usage. A l g o r i t h m s studied b y J a f f e [5] a n d later exten- sions b y G a f n i [2] a n d Mosely [9] are all g o o d e x a m p l e s o f this a p p r o a c h . I n this p a p e r we a n a l y z e a class o f d e c e n t r a l - ized d e c i s i o n - m a k i n g a l g o r i t h m s t h a t are b a s e d o n a special f o r m o f feedback, n a m e l y the f e e d b a c k f r o m the resource is a b i n a r y signal. This b i n a r y signal indicates w h e t h e r the resource is c u r r e n t l y o v e r l o a d e d or underutilized. A very g o o d r e a s o n for considering a b i n a r y f o r m o f f e e d b a c k is the m o t i v a t i o n o f m a k i n g the c o n t r o l l e r / m a n a g e r o f the resource as simple a n d efficient as possible. T h e r e q u i r e m e n t o f a b i n a r y f e e d b a c k o f t e n mini- mizes the w o r k at the resource in g e n e r a t i n g the feedback. 1.3. Notations a n d Definitions Figure 2 shows the a s s u m e d m o d e l o f the net- w o r k with n users sharing it. T h e c o n g e s t i o n state o f the s y s t e m is d e t e r m i n e d b y the n u m b e r o f p a c k e t s in the system. W e a s s u m e a discrete t i m e o p e r a t i o n with t i m e divided i n t o small slots. T h e s e slots basically r e p r e s e n t intervals at the b e g i n n i n g o f which the users set their l o a d level b a s e d o n the n e t w o r k f e e d b a c k received d u r i n g the p r e v i o u s interval. I f d u r i n g t i m e slot t, the i t h users l o a d is x i ( t ), t h e n the total l o a d at the b o t t l e n e c k re- source would b e T . x , ( t ) , a n d the state o f the s y s t e m is d e n o t e d b y the n - d i m e n s i o n a l v e c t o r x ( t ) = { x l ( t ) , x 2 ( t ) . . . . . x n ( t ) } . Since we are o p - erating at o r n e a r the knee, all resource~ de- m a n d e d b y the users are g r a n t e d (this is n o t true at the cliff). Thus, x A t ) d e n o t e s the i t h u s e r s 1.2. P a s t W o r k T h e a l g o r i t h m s studied here b e l o n g to a class o f d i s t r i b u t e d a l g o r i t h m s f o r m a n a g i n g d i s t r i b u t e d resources. A s p e c t r u m o f such d i s t r i b u t e d al- gorithrns h a v e b e e n studied in the literature. A t o n e e n d o f the s p e c t r u m , we h a v e centralized d e c i s i o n - m a k i n g . I n this p a r a d i g m , i n f o r m a t i o n ( a b o u t user d e m a n d s ) flows t o the resource m a n a g e r s , a n d the decision o f h o w to allocate the resource is m a d e a t the resource. T h e analysis b y S a n d e r s [12] is a g o o d e x a m p l e o f this a p p r o a c h . U s e r 1 U s e r n Y r.xi > Xgoat ( N e t w o r k Fig. 2. A control system model of n users sharing a network. D. -M. Chit~ R. Jain / Congestion Avoidance in Computer Networks d e m a n d as well as allocation of the systems re- sources. D u r i n g the interval, the system de- termines its load level and sends a binary feed- back y(t), which is interpreted by the users as follows: y ( t ) = { ~ ~ I n c r e a s e l ° a d , Decrease load. The users cooperate with the system and change (increase of decrease) their demands b y an a m o u n t ui(t ). Thus, x i ( t + 1) = x i ( t ) + ui(t ). (1) The change ui(t ) represents ith users control. It is a function of the users previous d e m a n d and the system feedback: u , ( t ) = f ( x i ( t ) , y ( t ) ) . (2) In other words, x i ( t + 1) = x i ( t ) + f ( x i ( t ) , y ( t ) ) . Notice that the users are not aware of other users individual d e m a n d s and, thus, c a n n o t make ui(t) a function o f xj(t), j ~ i. In general, the control function f ( ) can be any linear or nonlinear func- tion. However, we will focus first on linear con- trois. The state equations (1) reduce to x i ( t + 1) = ( a l + b i x i ( t ) if y ( t ) = 0 ~ Increase, a D + bDXi(t ) if y ( t ) = 1 ~ Decrease. Here, a l, bi, aD, b D are constants. The following are some examples of the control functions: (1) Multiplicative Increase/Multiplicative De- crease: x i ( t + l ) = [ b l x ~ ( t ) if y ( t ) = 0 ~ Increase, [bDXi(t ) if y ( t ) = 1 ~ Decrease. Here, b I > 1 and 0 < b D < 1. All users increase their d e m a n d s by multiplying their previous de- m a n d s by a constant factor. The decrease is also multiplicative. (2) Additive Increase/Additive Decrease: x i ( t + 1) =[al+xi(t ) i f y ( t ) = 0 ~ Increase, [ a D + x i ( t ) if y ( t ) = 1 ~ Decrease. Here, a i > 0 and a D < 0. All users increase their demands b y adding a c o n s t a n t a m o u n t to their previous demands. The decrease is also a d d i t i v e ) (3) Additive Increase/ Multiplicative Decrease: x i ( t + 1) = ( a l + x i ( t ) i f y ( t ) = 0 ~ I n c r e a s e , b o x i ( t ) if y ( t ) = 1 ~ Decrease. The increase is by a constant a m o u n t but the decrease is by a constant factor. (4) Multiplicative Increase/Additive Decrease: x i ( t + 1) / b~xi(t ) if y ( t ) = 0 ~ Increase, \ [ a o + x i ( t ) i f y ( t ) = l ~ D e c r e a s e . In order to evaluate the effectiveness of these controls, we next define a set of criteria explicitly in the next section. 1.4. Criteria for Selecting Controls The key criteria are: efficiency, fairness, distrib- utedness, and convergence. We define them for- mally as follows: (1) Efficiency: The efficiency of a resource usage i s defined by the closeness o f the total load o n the resource to its knee. I f Xgo~ ~ denotes the desired load level at the knee, then the resource is operat- ing efficiently as long as the total allocation X ( t ) = F.xi(t ) is close to X~oar Overload ( X ( t ) > Xgoal) or underload ( X ( t ) < Xgoal) are b o t h undesirable and are considered inefficient. We consider b o t h as equally undesirable. Notice, that efficiency relates only to the total allocations and thus two different allocations can b o t h be efficient as long as the total allocation is close to the goal. The distribution of the total allocation a m o n g individual users is measured b y the fairness criterion. (2) Fairness: The fairness criterion has been widely studied in the literature. W h e n multiple users share multiple resources, the maxmin fair- ness criterion has been widely a d o p t e d [2,3,5,10]. Essentially, the set of users are partitioned into equivalent classes according to which resource is their p r i m a r y bottleneck. The maxmin criterion then asserts that the users in the same equivalent i It is a s s u m e d t h a t t r u n c a t i o n is a p p l i e d w h e n a D + xi(t ) is l e s s t h a n z e r o , s o t h a t x i ( t ) w i l l n e v e r b e c o m e n e g a t i v e . D. -M. Chiu, R. Jain / Congestion Avoidance in Computer Networks class o u g h t to h a v e the equal share o f the b o t - tleneck. Thus, a s y s t e m in which x i ( t ) = x j ( t ) V i, j sharing the s a m e b o t t l e n e c k is o p e r a t i n g fairly. I f all users d o n o t get exactly equal allocations, the s y s t e m is less fair a n d we need a n index or a f u n c t i o n t h a t quantifies the fairness. O n e such i n d e x is [6]: F a i r n e s s : F ( x ) - ( E x ) 2 n(r ;i ) This index has the following properties: (a) T h e fairness is b o u n d e d b e t w e e n 0 a n d 1 (or 0\% a n d 100\%). A totally fair allocation (with all x i s equal) h a s a fairness o f 1 a n d a totally u n f a i r allocation (with all resources given to o n l y o n e user) h a s a fairness o f 1 / n which is 0 in the limit as n tends to oo. (b) T h e fairness is i n d e p e n d e n t o f scale, i.e., unit o f m e a s u r e m e n t does n o t matter. (c) T h e fairness is a c o n t i n u o u s function. A n y slight c h a n g e in allocation shows u p in the fairness. (d) I f o n l y k o f n users share the resource equally with the r e m a i n i n g n - k users n o t receiving a n y resource, t h e n the fairness is k / n . F o r o t h e r p r o p e r t i e s o f this fairness function, see [61. (3) Distributedness: T h e next r e q u i r e m e n t t h a t we p u t o n the c o n t r o l scheme is that it b e distrib- uted. A centralized s c h e m e requires c o m p l e t e k n o w l e d g e o f the state o f the system. F o r example, we m a y w a n t to k n o w e a c h individual u s e r s de- m a n d or their sum. This i n f o r m a t i o n m a y b e a v a i l a b l e at the resource. H o w e v e r , c o n v e y i n g this i n f o r m a t i o n to e a c h a n d every user causes consid- e r a b l e overhead, especially since a user m a y b e using several resources a t the s a m e time. W e are thus p r i m a r i l y interested in c o n t r o l schemes t h a t c a n b e i m p l e m e n t e d in real n e t w o r k s and, there- fore, we a s s u m e t h a t the s y s t e m does the m i n i - m u m a m o u n t o f feedback. I t o n l y tells w h e t h e r it is u n d e r l o a d e d o r o v e r l o a d e d via the b i n a r y feed- b a c k bits. O t h e r i n f o r m a t i o n such as Xsoal a n d the n u m b e r o f users sharing the resource are a s s u m e d to b e u n k n o w n b y the users. T h i s restricts the set o f feasible schemes. We, therefore, describe the set o f feasible schemes with a n d w i t h o u t this restric- tion. G o a l T o t a l load on t h e n e t w o r k ~ e _ _ ~ C R e s p o n s i v e n e s s ~ o o t h n e s s T i m e Fig. 3. Responsiveness and smoothness. (4) Convergence: Finally we require the c o n t r o l scheme to converge. C o n v e r g e n c e is generally m e a s u r e d b y the speed with which (or t i m e t a k e n till) the s y s t e m a p p r o a c h e s the goal state f r o m a n y starting state. H o w e v e r , d u e t o the b i n a r y n a t u r e o f the feedback, the s y s t e m d o e s n o t generally converge to a single s t e a d y state. R a t h e r , the sys- t e m reaches a n e q u i l i b r i u m in which it oscillates a r o u n d the o p t i m a l state. T h e t i m e t a k e n to r e a c h this e q u i l i b r i u m a n d the size o f the oscillations j o i n t l y d e t e r m i n e the convergence. T h e t i m e de- termines the responsiveness, a n d the size o f the oscillations d e t e r m i n e the smoothness o f the c o n - trol. Ideally, we would like the t i m e as well as oscillations to b e small. Thus, the c o n t r o l s with smaller t i m e a n d smaller a m p l i t u d e o f oscillations are called m o r e responsive a n d m o r e s m o o t h , re- spectively, as s h o w n in Fig. 3. 1.5. Outline o f this P a p e r I n this p a p e r , we d e v e l o p a s i m p l e a n d intuitive m e t h o d o l o g y to explain w h e n a n d w h y a c o n t r o l converges. W e a d d r e s s the following questions: W h a t are all the possible solutions that converge to efficient a n d f a i r states? H o w do we compare those controls that converge? T h e p a p e r is o r g a n i z e d as follows. I n Section 2 we will characterize the set o f all linear c o n t r o l s t h a t c o n v e r g e and, thus, i d e n t i f y the set o f feasible controls. T h e n we n a r r o w d o w n the feasible set to a subset t h a t satisfies o u r d i s t r i b u t e d n e s s criterion. T h e s e subset still includes c o n t r o l s t h a t h a v e un- a c c e p t a b l e m a g n i t u d e s o f oscillation o r those t h a t c o n v e r g e t o o slowly. T h e n in Section 3, we discuss 6 D.-M. Chiu, R. Jain / Congestion Avoidance in Computer Networks h o w to find the subset of feasible distributed controls that represent the optimal trade-off of responsiveness and smoothness, as we defined in convergence. In Section 4, we discuss how the results extend to nonlinear controls. A n d in the last section we summarize the results and discuss some o f the practical considerations (such as sim- plicity, robustness, and scalability). 2. Feasible Linear Controls 2.1. Vector Representation o f the Dynamics In determining the set o f feasible controls, it is helpful to view the system state transitions as a trajectory through an n-dimensional vector space. We describe this m e t h o d using a 2-user case, which can be viewed in a 2-dimensional space. As shown in Fig. 4, a n y 2-user resource al- location {Xl(t), x 2 ( t ) } Can be represented as a point ( x 1, x 2 ) in a 2-dimensional space. In this figure, the horizontal axis represents allocations to user 1, and the vertical axis represents allocations to user 2. All allocations for which x I + x 2 = Xgoa l are efficient allocations. This corresponds to the straight line m a r k e d e f f i c i e n c y line. All al- locations for which x 1 = x 2 are fair allocations. This corresponds to the straight line m a r k e d f a i r - ness line. T h e two lines intersect at the poin t ( X goal/2, Xgo~/2 ) that is the optimal point. T h e goal of control schemes should be to bring the l Equi- F a i r n e s s F a i r n e s s U s e r ~ L m ~ L i n e 2s ~ ~ / / Alloc- ] ~ / / O v e r l o a d U s e r l s A l l o c a t i o n x t Fig. 4. Vector representation of a two-user case. system to this p o i n t regardless o f the starting position. All points below the efficiency line represent an u n d e r l o a d e d system an d ideally the system would ask users to increase their load. Consider, for example, the p o i n t x 0 = (xl0, x20 ). T h e ad- ditive increase policy o f increasing b o t h users allocations b y a~ co rresp o n d s to m o v i n g along a 45 ° line. T h e multiplicative increase policy o f increasing b o t h users allocations b y a f a c t o r b I c o r r e s p o n d s to moving along the line that con- nects the origin to the point. Similarly, all points above the efficiency line represent an o v e r l o a d e d system an d additive decrease is represented b y a 45 ° line, while multiplicative decrease is rep- resented b y the line j o i n i n g the p o i n t to the origin. T h e fairness at a n y p o i n t ( x 1, x 2 ) is given b y (Xl + x 2 ) 2 Fairness - 2 ( x 2 + x22) N o t i c e that multiplying b o t h allocations b y a fac- tor b does n o t change the fairness. T h a t is, ( b x 1, bx2) has the same fairness as ( x 1, x 2 ) fo r all values o f b. Thus, all points o n the line j o i n i n g a p o i n t to origin have the same fairness. We, there- fore, call a line passing t h ro u g h the origin a equi-fairness line. T h e fairness decreases as the slope o f the line either increases ab o v e o r de- creases below the fairness line. Figure 5 shows a c o m p l e t e t raj ect o ry o f the two-user system starting f r o m p o i n t x 0 using an additive i n c r e a s e / m u l t i p l i c a t i v e decrease c o n t r o l policy. T h e p o i n t x 0 is below the efficiency line a n d so b o t h users are asked to increase. T h e y d o so additively b y moving along at an angle o f 45 o. This brings t h em to x~ which h a p p e n s to b e above the efficiency line. T h e users are asked to decrease an d they d o so multiplicatively. This c o r r e s p o n d s to moving towards the origin o n the line j o i n i n g x 1 and the origin. This brings t h em to p o i n t x 2, which h ap p en s to be below the efficiency line an d the cycle repeats. N o t i c e that x 2 has higher fair- ness t h an x 0. Thus, with every cycle, the fairness increases slightly, an d …

83529

RESEARCH PAPER SUMMARY

Research Paper Summary - Computer Science

CATEGORIES