“measurements-in-the-middle” : inferring end-end path properties and characteristics of tcp ...

October 30, 2017 | Author: Anonymous | Category: N/A

Share Embed

Report this link

Short Description

Laurie Connors, Betty Hardy, Pauline Hollister and Sharon Mallory have been an em ......

Description

“MEASUREMENTS-IN-THE-MIDDLE” : INFERRING END-END PATH PROPERTIES AND CHARACTERISTICS OF TCP CONNECTIONS THROUGH PASSIVE MEASUREMENTS

A Dissertation Presented by SHARAD JAISWAL

Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY September 2005 Computer Science

c Copyright by Sharad Jaiswal 2005 ° All Rights Reserved

“MEASUREMENTS-IN-THE-MIDDLE” : INFERRING END-END PATH PROPERTIES AND CHARACTERISTICS OF TCP CONNECTIONS THROUGH PASSIVE MEASUREMENTS

A Dissertation Presented by SHARAD JAISWAL

Approved as to style and content by:

Jim Kurose, Chair

Don Towsley, Member

Mark Corner, Member

Lixin Gao, Member

Gianluca Iannaccone, Member

W. Bruce Croft, Department Chair Computer Science

to daadi, and my parents

ACKNOWLEDGEMENTS

My path to a PhD has been long, hard and enriching. I have several people to thank in this process. First and foremost are my advisors, Jim Kurose and Don Towsley. They are exemplary advisors, whose constant support, expert guidance and attention to all aspects of a student’s development in graduate school has contributed immensely towards my growth both professionally and personally over these years. Through their scientific rigor, and approach to research, they have set standards that I will strive to meet throughout my career. Gianluca Iannaccone has been a collaborator for the past four years, since I started on my thesis problem as an Intern at Sprint Labs. I have learnt a great deal about how to do research and how to think about research problems through him. My heartfelt thanks to him, as a mentor and as a friend. I am thankful to Mark Corner and Lixin Gao, for serving in my committee and for several interesting and stimulating discussions about my work. I have had the good fortune to work with a highly collegial, motivated and smart group of fellow graduate students. I have gained immensely through my collaborations, in particular, with Zihui Ge, Daniel Figueiredo and Wei Wei. I hope to continue interacting with my peers in the Networks group in the future. Laurie Connors, Betty Hardy, Pauline Hollister and Sharon Mallory have been an embodiment of the professionalism and care that characterizes our department. Tyler Trafford has been an invaluable resource, helping setup several experiments and troubleshoot technical problems. My friends in Amherst have been a wonderful source of support for the past several years, and with whom I have shared many wonderful moments together. In particular, I

v

wish to thank Hema Raghavan, Kausalya Murthy and Nasreen Abduljaleel, whose home has been my second home, for their affection and companionship, and for being by my side through many ups and downs; Abhishek Chandra, my old house-mate for three years; Yogita Mehra, for her unstinting support and faith in me through my PhD; Puru Kulkarni, Pritesh Sharma and Sudarshan Vasudevan, who have been great friends, partners in grad school and in many hikes through the years; Sonal Chitnis and Preyasee Kamath for many fun and unforgettable moments together and to Swati Birla, Ravisubhash Tangirala, Bhuvan Urgaonkar and Ramesh Nallapati, who have made life in Amherst and my graduate school experience so much more enjoyable and fun. My sister, Sanyukta, is an affectionate and patient friend, and an always dependable source of advice and support. And finally, my parents, Uma and Vijai Shanker Jaiswal. They built my foundation and showed me the way, and have since always stood on the side, ready with encouragement, support and advice. With my respects, gratitude and immense affection, I dedicate this thesis to them.

vi

ABSTRACT

“MEASUREMENTS-IN-THE-MIDDLE” : INFERRING END-END PATH PROPERTIES AND CHARACTERISTICS OF TCP CONNECTIONS THROUGH PASSIVE MEASUREMENTS SEPTEMBER 2005 SHARAD JAISWAL B.E., THE REGIONAL ENGINEERING COLLEGE, TIRUCHIRAPALLI, INDIA M.S., BOSTON UNIVERSITY Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Jim Kurose

We propose a methodology to infer end-end path properties of TCP connections (such as packet loss, reordering and delay) and other factors that affect TCP sender behavior using passive measurements collected at a single point of observation along the end-end path. With this passive approach, an observation point in a Tier-1 backbone network can observe and analyze millions of TCP connections, which originate and terminate from a highly diverse cross-section of end points in today’s Internet - a capability that is unmatched by current active measurement techniques. We apply our passive measurement inference techniques on traces collected in a large Tier-1 backbone network. We analyze the causes of out-of-sequence packets (due to loss, reordering and in-network duplication), the distribution and variation of round-trip delay, the congestion control flavors of TCP, and the extent vii

to which application-level considerations limit TCP throughput. We validate the accuracy of our measurement-based inference techniques by comparing inferred behavior with active measurements made on monitored flows. But empirical validation alone is insufficient, since monitored paths can exhibit a wide range of network properties, many of which may occur only rarely. Thus, we use a combination of model-checking and formal reasoning to identify all possible events in the network for which the inference rules can produce incorrect results.

viii

TABLE OF CONTENTS

Page ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2

Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2. CLASSIFYING OUT-OF-SEQUENCE PACKETS . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 2.2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 2.2.2 2.2.3 2.2.4

2.3 2.4

Retransmissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unneeded retransmissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14 17 17 18

Sources of estimation inaccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3. INFERRING TCP CONNECTION CHARACTERISTICS . . . . . . . . . . . . . . . . . 22 3.1 3.2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Keeping track of the congestion window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1

TCP flavor identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1.1

Use of SACK and ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 ix

3.3 3.4

Round-trip time estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Sources of estimation uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 3.4.2 3.4.3 3.4.4

3.5

Under-estimation of cwnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Over-estimation of cwnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Window scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Incorrect implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33 34 35

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4. VALIDATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 4.2 4.3

Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Experiments over the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5. APPLICATION OF INFERENCE TECHNIQUES TO TRACES FROM A TIER-1 BACKBONE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1

Measurement Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.1 5.1.2 5.1.3

5.2

Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 5.2.2

5.3

Representativeness of sampled flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Measuring end-end reordering/duplication frequency . . . . . . . . . . . . . . 57

Characteristics of TCP Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.1 5.3.2 5.3.3 5.3.4

5.4

Measurement data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Out of sequence packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Network anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Congestion window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TCP flavors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Greedy senders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Round trip times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 62 64 66

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6. VERIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 6.2 6.3

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Problem definition and approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Specification and Verification in SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3.1

Discussion of model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

x

6.4

Verifying inference heuristics for GB(N ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4.1 6.4.2 6.4.3 6.4.4 6.4.5

6.5 6.6 6.7 6.8 6.9

Correctness of GB(N) in the presence of reordering . . . . . . . . . . . . . . . Detecting retransmitted and reordered packets . . . . . . . . . . . . . . . . . . . . ACK losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unneeded Retransmissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classifying out-of-sequence packets . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77 80 82 84 84

Generalization to any N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extensions to GB(N ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implications for TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inferring all retransmission events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86 88 94 96 98

7. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.1

Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

APPENDICES A. PSEUDO-CODE FOR INFERRING TCP VARIABLES AND CONGESTION CONTROL FLAVOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B. PROMELA CODE FOR GB(N ) PROTOCOL . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 C. PROMELA CODE FOR F AST RET X(N ) PROTOCOL . . . . . . . . . . . . . . . . . . 113

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

xi

LIST OF TABLES

Table

Page

4.1

Out-of-sequence packets inference and classification with dummynet induced losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2

Out-of-sequence packet inference and classification with losses at wireless hop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1

Summary of the traces used for OoS classification . . . . . . . . . . . . . . . . . . . . . . . 47

5.2

Average distances from our measurement point . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3

Out-of-sequence packet classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4

Number of originating ASes, for studied TCP connections and all TCP connections in the traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.5

Percentage of packets reordered or duplicated after the measurement point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.6

Summary of the traces for TCP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.7

TCP flavors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.1 Pseudo-code describing the RTT estimation scheme . . . . . . . . . . . . . . . . . . . . . 108

xii

LIST OF FIGURES

Figure

Page

2.1

Retransmission due to a loss after the measurement point . . . . . . . . . . . . . . . . . 13

2.2

Decision process for the classification of out-of-sequence packets . . . . . . . . . . 13

2.3

Retransmission due to a loss before the measurement point . . . . . . . . . . . . . . . . 15

2.4

An unneeded retransmission due to a lost ACK . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1

TCP running sample based RTT estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2

Differences in RTT estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1

Mean relative error of cwnd and RTT estimates in the simulations . . . . . . . . . . 38

4.2

Validation experiments setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3

Cumulative error of RTT inference scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4

Cumulative error in cwnd estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5

Cumulative error of RTT inference scheme - different metric . . . . . . . . . . . . . . 42

5.1

Distribution of time-to-live values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2

Distribution of the path length (in router-level hops) from the end-hosts to our measurement point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3

Scatter plot of the average number of hops from measurement point, and the % of OoS packets, in our traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4

Percentage of Out-of-Sequence Packets in all TCP flows, and symmetric TCP flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5

Packet reordering between the measurement point and the receiver . . . . . . . . . 58

xiii

5.6

Cumulative fraction of senders as a function of the maximum window . . . . . . 61

5.7

Cumulative fraction of senders as a function of the maximum window . . . . . . 61

5.8

Cumulative fraction of packets as a function of the sender’s maximum window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.9

Percentage of Reno/NewReno senders (above) and packets (below) as a function of the data packets to transmit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.10 Fraction of greedy senders based on the distance between measurement point and receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.11 qq-plot of flow size between flows with large ACK times, and all flows . . . . . . 66 5.12 Top: CDF of minimum RTT; Bottom: CDF of median RTT . . . . . . . . . . . . . . . 66 5.13 Variability of RTT. Top: ratio 95th/5th percentile; Bottom: difference between 95th and 5th percentiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1

The network scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2

The network model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.3

Timelag of a reordering event: packet with sequence number x is reordered with subsequently sent sequence number x + 1 . . . . . . . . . . . . . . 78

6.4

Cases in which retransmitted and reordered packets are undetected at measurement point, for GB(3), m = 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5

Case E4 (x, t00 ), GB(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.6

Case E5 (x, t00 ), GB(3): Packet with sequence number x = 2 is lost before the measurement point. Packet with sequence number y = 4 is the first packet with sequence number > 2 sent by the sender that reaches the measurement point. The retransmission of 2 is reordered with sequence number 4 and is not detected as OoS at the measurement point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.7

Recovery through Fast Retransmit in F astRetx(4), dupacks = 2 . . . . . . . . . . 89

6.8

Event E7 (x, t00 ). Packet with sequence number 5 is lost before measurement point and, due to a successive fast retransmit event, is retransmitted before sender could send any packets with sequence number > 5. The retransmission of 5 is not detected to be OoS. . . . . . . . . . 90 xiv

6.9

Undetected retransmissions in TCP Reno . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.1 Variables required to keep track of the cwnd of the three TCP flavors . . . . . . 103 A.2 Function used to update cwnd during slow-start and congestion avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 A.3 Pseudocode to track congestion window for Tahoe . . . . . . . . . . . . . . . . . . . . . . 104 A.4 Pseudocode to track congestion window for Reno . . . . . . . . . . . . . . . . . . . . . . 105 A.5 Pseudocode to track congestion window of NewReno . . . . . . . . . . . . . . . . . . . 106 A.6 Function used to reset ocwnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.7 Testing TCP flavors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

xv

CHAPTER 1 INTRODUCTION

Internet measurement have been an important and active area of research since the early days of the Internet [30]. Measurements provide insight into the nature and characteristics of network traffic and network performance, and thus have several important applications. Measurements provide quantitative information about performance problems and the occurrence of undesirable events. Inferences about network conditions based on measurements can help locate faults, hot-spots and bottleneck regions. Finally, measurements are used to provision links and routers, to design new architectures and protocols, and as inputs to mathematical models of various network components. Research in network measurement has progressed in two general directions. One thrust has been to develop techniques and tools that actively generate traffic between two endhosts, collect measurements at end points, and use these measurements to make inferences about the end-end path between these hosts. A key advantage of this approach is that one has complete control over the end-points, and the traffic generated, thus making it easier to measure or infer characteristics of the end-end path. Techniques using this general approach have been used to characterize end-end loss [43, 51, 12, 13], delay [43, 12], path pathologies such as reordering and replications [43, 9], and available bandwidth [4, 46]. More sophisticated approaches have exploited the correlation between end-end measurements taken from different end points to infer loss rate and delay of links within the network, that are not directly observable [34, 15, 17, 18]. While active measurements can be used to characterize end-end performance, their implementation requires the co-ordination, instrumentation and deployment of end-host

1

measurement points. Since it can be hard to gain control over multiple end-points, this becomes an extremely complex endeavor as the number of end-points increases. Thus, it is difficult and impractical to use active approaches for large-scale wide area measurements. The second direction of research has been to passively observe traffic at a single point within the network. Measurements are taken by a monitor, which passively observes packets passing through a link. Importantly, no new traffic is added into the network. The primary advantage of this approach is its scalability. From a single measurement point, a traffic monitor can observe a very large and diverse set of end-end paths, connections and end-hosts. Setting up these measurement points also requires access to fewer points in the network, and is thus logistically less complex. Also, since these measurements only involve existing traffic, the process is non-intrusive and does not itself impact network traffic. This also eliminates concerns about the implications and representativeness of any inferences over these measurements with respect to ”real” traffic, that would arise when using synthetic traffic. Passive measurements have been used in previous studies to characterize packet inter arrival times, packet sizes, and protocol and application mix of the traffic observed at the measurement point. Similarly, one can also use SNMP information collected at a router, to determine the number of packets forwarded, the number of dropped packets, and state of the links connected to the router. What is common to these studies is that they capture only localized performance measures and characteristics of the traffic traversing the observation point. An important application of passive measurements is to infer the end-end performance of the paths taken by the packets passing through the measurement point. A passive monitor established in an ISP’s network and making end-end inferences can have several practical applications. By monitoring the traffic of its customers, an ISP can track performance metrics (such as loss rates, reordering rates and end-end delay) of the end-end paths traversed by these connections. The ISP can quickly detect and infer the causes of poor performance,

2

such as extensive reordering on an access link, or packet drops due to congestion caused by a sudden traffic spike on some links. The ISP can also determine the location of the problem, whether it occurred on the access link to the ISP’s network, within its network, or elsewhere. It can combine information about the performance of multiple end-end paths to identify (and remediate) the particular internal network elements causing the problem. The ISP can also correlate the impact of events in the network (such as link outage and route changes) and sudden changes in traffic load (due to flash crowds, internet worms etc.) on the end-end performance of its customers. In addition to assisting an ISP’s operations, passive measurements can also provide unique insights into the characteristics of Internet traffic. Given that a passive measurement point can observe a large amount of traffic originating from a very diverse set of end-end paths a large and, therefore, more representative amount of traffic can be sampled than with active measurement techniques. The aim of this thesis is to develop techniques that infer end-end performance (such as packet loss and delay), and end-host properties (implementations, sender behavior) from passive measurements taken at a single point within the network. In this work, we will focus on connections implementing TCP. TCP (Transmission Control Protocol) is the dominant end-to-end transport protocol currently deployed in the Internet. A wide range of applications, including Web traffic, grid applications, and newly emerging peer-to-peer applications, rely on TCP’s transport services. Given this reliance on TCP, there is currently great interest in understanding TCP’s performance and characterizing the factors (such as network congestion, sender/receiver buffer limits, and sender data-starvation) that can limit its behavior in practice.

1.1 Thesis contributions We now outline our main contributions, which are:

3

• A passive measurement based methodology: We present inference techniques based on passive measurements to – Detect out-of-sequence packets at the measurement point, and classify them by cause as: ∗ retransmissions: that are in response to an earlier packet loss and hence indicate congestion or poor channel conditions in the path. ∗ reorderings: caused by load balancing, or route changes, and can adversely affect sender rate by triggering retransmissions. ∗ replications: caused by faulty equipment or routing loops in the path. Quantifying and classifying the occurrence of such phenomea is important since they indicate a measure of the “health” of the end-end path. – Infer and keep track of a TCP connection’s Round Trip Time (RTT) estimate through a connection’s lifetime. End-end delay and variation in RTT is an indication of congestion along end-end path. Knowledge of a connection’s RTT is also of interest to compute the amount of buffering required in a link, or for configuring Active Queue Management. – Infer and keep track of the TCP sender’s state (whether the sender is in slowstart, congestion avoidance, or recovering from a timeout or TCP’s fast retransmit mechanism), and variables, such as the congestion window of the sender. Tracking these variables allows us to: ∗ infer whether sender throughput is limited by the network, or by application behavior, or end-host buffers. ∗ infer the flavor of the TCP congestion control algorithm (e.g., whether the TCP sender is Tahoe, Reno or NewReno). The congestion control mechanism determines the sender’s response (and hence its sending rate) in the event of a loss. 4

Our approach will be validated through a mixture of simulations, real world experimentation and formal analysis. • Passive measurements in a Tier-1 backbone: We apply our techniques to traces collected within the Sprint backbone network, a large Tier-1 provider. Our study is unique in that it examines a remarkably large and diverse number of TCP connections. Given our passive methodology and measurement points within a Tier-1 network provider, we are able to analyze more than 10 million connections, with senders located in more than 45% of the autonomous systems in today’s Internet. Our results quantify the extent to which out-of-sequence packets are observed, and the events which cause them - such as TCP packet retransmissions, reorderings, and replications. We also study the impact that events such as reorderings, and retransmissions (relative frequency of TCP timeouts vs. fast retransmissions) have on TCP connections. We study the variation of the round trip time delay within TCP connections. We also examine factors that affect the throughput of a TCP connection, e.g., whether the throughput is limited by a lack of data to send, or by network congestion, and study the impact of TCP congestion control flavors on TCP connections. • Formal analysis of inference techniques: The nature of passive measurements also presents a number of challenges. Since the monitor can be located anywhere on the end-end path, it has limited knowledge of all the events that occur along the path. Also, since monitored traffic might arise from and traverse a very heterogeneous set of sources and end-end paths, inference techniques must deal with a wide range of path properties (end-end delay, packet loss, reordering rates etc.) and network conditions along the path (link failures, route changes etc.). The limited observability of network events and their heterogeneity underscore one of the main challenges in performing passive measurements: “How to determine

whether the inferences made by the passive monitor are correct and complete (i.e. all

5

events are detected)?” Simulations and real world experiments allow one to validate inference techniques, the traffic generated or observed, but they fall short of emulating all possible network scenarios under which the techniques may fail. We apply this formal verification based to the problem of inferring packet retransmissions and reorderings from passively observed packets at a single measurement point. We define classification rules for this inference problem and, through a combination of model-checking and formal reasoning, uncover all possible events in the network for which the rules produce incorrect inferences. Our work is novel in its use of formal verification tools for evaluating inference techniques in network measurements. The techniques developed in this dissertation can serve as a valuable diagnostic tool for ISPs, to infer the performance of the end-end paths of the connections that traverse them, and to understand the factors that limit the performance of the end hosts. Moreover, given the diversity of the paths examined, and the large number of connections monitored, our techniques enable a much richer, wider and more representative set of measurements to be undertaken, in the Internet.

1.2

Thesis Outline

This thesis is organized as follows. We describe our methodology for classifying outof-sequence packets in Chapter 2. Several of our classification heuristics depend on an accurate estimate of the sender’s Round Trip Time estimate, and in Chapter 3 we present our approach to keep track of a TCP sender’s state variables, the flavor of the congestion control algorithm implemented at the sender, and how to use the estimate of the sender’s window to compute the RTT of a TCP connection. We validate our techniques through simulations and real-world measurements in Chapter 4. In Chapter 5 we apply our techniques to traces collected in the Sprint IP backbone, and present measurement results about the occurrence and causes behind out-of-sequence phenomenon, and factors which affect TCP sender behavior, for TCP connections going through the Sprint backbone network. 6

In Chapter 6 we formally verify our inference techniques through a combination of model checking and formal reasoning about network and protocol behavior. Finally in Chapter 7, we present a summary and conclude this proposal.

7

CHAPTER 2 CLASSIFYING OUT-OF-SEQUENCE PACKETS

An important characteristic of any TCP connection is the sequencing of packets within that connection. Generally, if sequence numbers are monotonically increasing, then all is well - data flows through that connection without loss, and the network does not introduce pathological problems such as in-network duplication and reordering. Conversely, out-ofsequence packets indicate that the connection suffers from loss, duplication or reordering. It is thus of interest to study the occurrence of out-of-sequence packets within an Internet TCP connection, and to identify their causes, as the magnitude of out-of-sequence packets is a good indicator of the “health” of that TCP connection and the path that it traverses. We now present techniques to classify out-of-sequence packets in TCP connections observed at a single measurement point in the end-end path. Informally, we say that a packet is out-of-sequence if it has a sequence number that is smaller than or equal to that of a previously observed packet in that connection1 . Our first contribution is methodological; we passively measure out-of-sequence packets at a single point in the end-end path (rather than by actively sending and measuring end-to-end probe traffic at the sender or receiver [9, 43]). Therefore, a new methodology is required to infer the causes of a connection’s out-of-sequence packets based only on observations from the “middle” of its end-end path. An advantage of this approach is that the measurement point can be located within a large network, e.g. a backbone network, and thus be able to characterize the behavior of flows 1

In this work, the terms “out-of-sequence” and “reordered” do not have the same meaning. As will be discussed shortly, out-of-sequence packets can be caused by sender retransmissions, in-network duplication, and reordering of packets within the network on their end-to-end path. Previous studies have often used the terms “out-of-sequence” and “reordered” interchangeably.

8

between a very large number of source-destination pairs, without having to instrument the individual senders and receivers. However, there are also several methodological challenges associated with having a single measurement point in the “middle” of a connection’s end-end path. The measurement point can only observe a limited set of events along the end-end path, and does not know the actions taken at the sender/receiver end; it can only infer this from the previously- or subsequently-observed packets from that connection and the knowledge of TCP behavior. We describe techniques and rules to infer and classify the causes of observed out-of-sequence behavior and validate these techniques empirically in a wide-area test bed. Our second contribution is the characterization of the out-of-sequence behavior in the Internet. We infer the out-of-sequence behavior of 29 million TCP connections, generated in more than 7,600 unique Autonomous Systems, through traces collected in a large Tier-1 backbone network. We shall describe the results from this study in detail in Chapter 4. The remainder of this chapter is structured as follows. We begin in Section 6.1 by discussing related work. We describe the rules and rationale that we use to infer and classify the causes of observed out-of-sequence behavior in Section 2.2. In Section 3.4 we identify sources of uncertainty and possible errors in our inference and we conclude in Section 5.4 by summarizing out contributions and outlining the empirical and formal validation of our techniques that we shall describe in later chapters.

2.1 Related work A number of previous efforts have examined packet reordering. Bennett et al. [9] sent bursts of ICMP probe packets through a network public exchange point and reported on the magnitude of packet reordering induced in these probes by a specific switch in the exchange point. Paxson [43] also examined the extent and effects of packet reordering and replication in TCP bulk data transfers between pairs of measurement sites. Recently, Bellardo et al. [7] described active measurement techniques that can estimate packet reordering in TCP data

9

transfers from a remote host, although they did not carry out extensive measurement studies using their techniques. Our work differs from these earlier works in both the scope of the measurements (millions of connections from thousands of different networks) and our methodology (we passively measure traffic at a single point within the network, rather than actively sending probes and taking both source and destination measurements). We will discuss our results in relation to these studies in more detail in Chapter 5. There has also been recent work on inferring loss properties of an end-end path based on passive observations of TCP connections. In [8], the authors consider measurements taken at a core or an ingress router, and compute the ratio of packets lost before the measurement point, to those lost after the measurement point. However, the focus of this work is on the estimation of the packet loss ratio rather than on the characterization of the out-ofsequence phenomenon, and achieves its objectives by discarding all packets for which it is not possible to straightforwardly classify a packet as a retransmission. Moreover, their approach assumes that OS stacks to generate unique and monotonically increasing IPID values - a property that is not universal and prone to change for various security reasons. Our work instead develops a more general approach based on assumptions about network behavior, knowledge of TCP protocol and information in the packet headers (such as the IPID field) to classify all packets that traverse the measurement point.

2.2 Methodology In this section, we describe the methodology and rules used to classify the out-ofsequence packets observed at the measurement point. We capture and record the first 44 bytes of IP and TCP packet headers of all packets passing across a link. We collect two traces, for each monitored link, corresponding to data flowing in either direction. In Chapter 5 we will describe the specific links within the Sprint backbone at which the measurements were taken. For our purposes here, we need only consider these measurements as a recorded sequence of packet headers.

10

Given traces of observed sender-to-receiver data packet headers and receiver-to-sender acknowledgment headers, in a particular link, we process the traces to classify the causes of out-of-sequence measurements. The first step in processing the traces is to filter them in order to consider only those connections for which sender-to-receiver data packets, and the receiver-to-sender acknowledgment pass through the link at which the measurements are taken. We report on the fraction of connections for which this requirement holds, and the possible sampling bias the removal of several TCP flows introduces in Chapter 5. Once the trace is filtered, we can identify and classify the out-of-sequence packets. A packet is out-of-sequence if its sequence number is less than or equal to that of a previously observed sequence number in that connection. An out-of-sequence packet can be caused by three different events: • Retransmission. In this case, a sender infers that a packet has been lost and retransmits the packet. The sequence number of the retransmitted packet is smaller than previously observed packets of the same TCP connection at the measurement point, and hence will be deemed “out-of-sequence.” • Network duplication. In this case, a non-sender-retransmitted copy of a packet is observed. This can occur when the measurement point is within a routing loop (and hence the same packet is observed more than once), or if the network itself creates a duplicate copy of the packet. • In network-reordering. The network can invert the order of two packets in a connection (because of parallelism within a router [9] or a route change). As noted earlier, previous studies have used the terms “reordering,” “out-of-order” and “out-of-sequence” interchangeably. We emphasize that a reordering event in our classification is just one of a possible set of events that result in an out-of-sequence packet. Our classification method will use the knowledge of observed sequence and acknowledgment

11

numbers in the TCP header, the identification field in the IP datagram (IPID), and the times at which observations are made to infer the cause of an observed out-of-sequence packet. Every TCP packet going into the network has a sequence number field identifying the beginning of the payload carried by this packet. Also, there is an acknowledgement number field that cumulatively acknowledges the last sequence number of the data observed from the opposite direction. Any packet going through the IP stack in the sender’s kernel is then assigned a value in the IPID field in the packet’s IP header. In many current TCP/IP implementations this is an unique and monotonically increasing value assigned to every outgoing IP packet. Hence, if a packet is retransmitted by TCP it may have the same sequence number, but a different IPID value. When the measurement point observes a sender-to-receiver data packet with an IPID value W and a TCP sequence number of x at time t, it is referred using the 3-tuple notation (W, x, t). Figure 2.1 illustrates this notation. In this example, the sender sends packets x − 1 through x + 3. A packet with sequence number x and IPID W is observed at the measurement point at time t (and denoted by (W, x, t)). Because of the subsequent loss of x between the measurement point and the receiver, the sender will eventually retransmit x. The retransmitted copy observed by the measurement point at time t is then denoted by (W 0 , x, t0 ). We reiterate to ensure our notation is clear that, when we use the 3-tuple notation (W, x, t) to refer to or identify a TCP packet, we are referring to the observation of a packet with IPID W and sequence number x, at a time t at the measurement point. Figure 2.2 summarizes the decision process for classifying out-of-sequence packets. The edges leading to leaf nodes (the classifications) in the decision tree in Figure 2.2 are annotated with the decision rules R1 through R7, described below) corresponding to those classifications. Let us now consider these classification rules.

12

Figure 2.1. Retransmission due to a loss after the measurement point

Figure 2.2. Decision process for the classification of out-of-sequence packets

13

2.2.1 Retransmissions Figure 2.1 shows the case in which a packet (W, x, t) is observed at the measurement point, and no acknowledgment covering x (i.e., an acknowledgment for a packet with a sequence number greater than or equal to x) is observed before another packet (W 0 , x, t0 ) with the same sequence number is observed. Rule R1 below specifies the cases in which this second packet is classified as a retransmission. Rule R1

Assume (W, x, t) is observed at the measurement point and no acknowledgment

covering x is observed before another packet (W 0 , x, t0 ) with the same sequence number, x. The interval (t0 − t) between the arrival of these two packets at the measurement point is referred to as the timelag. In this case, either (W, x, t) is lost between the measurement point and the receiver, or its ACK is lost between the receiver and the measurement point. Then (W 0 , x, t0 ) is classified as a retransmission if any of the following conditions are true: • R1.a: W 0 6= W • R1.b: the timelag (t0 −t) > RT O, i.e., the time between the observation of (W, x, t) and (W 0 , x0 , t0 ) is greater than RTO, the estimated sender timeout interval. Note that since RTO is a function of the RTT, we will need to estimate RTT. We will need RTT for other purposes as well. We address the problem of estimating the sender’s RTT in Chapter 3. • R1.c:

the number of duplicate ACKs observed in the period (t, t0 ) exceeds the

sender’s duplicate ACK threshold (which we take in this paper to be the typical value of 3). A closely related scenario to those covered in R1 is the case in which an out-of-sequence packet (W 0 , x, t0 ) is observed, but as shown in Figure 2.3, the measurement point has not previously observed a packet with sequence number x. In this case, the original packet x was sent by the sender but lost before the measurement point. 14

Figure 2.3. Retransmission due to a loss before the measurement point

Rule R2

Assume (W 0 , x, t0 ) is observed, but no packet with sequence number x has been

observed before at the measurement point. Also assume that no acknowledgment covering x is observed before (W 0 , x, t0 ). In this case, (W 0 , x, t0 ) is classified as a retransmission if any of the following conditions are true: • R2.a: timelag (t0 − t00 ) > RT O, where t00 is the earliest time at which a packet with a sequence number greater than x is observed. We use t00 here since, as noted above, the measurement point has not previously observed a packet with sequence number x. • R2.b: the number of duplicate ACKs observed after t00 but before t0 exceeds the sender’s duplicate ACK threshold (see rule R1.c). In addition, we require that t0 − t00 > RT T since a sender would not react to any event in the network in less than one RTT. Rules R1 and R2 rely on the notion that packet retransmissions either have distinct IPIDs, or timelag > RT O or have elicited at least 3 duplicate ACKs from the receiver. However, not all retransmissions from the sender meet this criteria. After a TCP sender receives 3 duplicate ACKs, it retransmits the packet for which the duplicate ACKs were intended, and enters the fast recovery phase. At this point, the flavor of the sender’s congestion control algorithm comes into play. If more than one packet was lost in the window 15

before the sender entered the recovery phase, the sender need not observe three duplicate ACKs for each of the lost packets in order to retransmit these packets. For example, in TCP NewReno, after entering the fast recovery phase, the sender stores the sequence number of the most recently observed data packet, in the variable sndHigh. There after, if the TCP sender receives acknowledgments for packets with a sequence number less than sndHigh2 , the sender promptly retransmits the packet. The TCP sender eventually exits fast recovery, when it receives an acknowledgment that covers the sequence number stored in sndHigh. In order to account for fast recovery retransmissions, that are not triggered by > 3 duplicate ACKs, we introduce two variables into our classification scheme. The variable inf astRecovery is a flag associated with a monitored TCP connection, which is set when the TCP sender enters the fast recovery phase; the variable sndHigh stores the value of the last sequence number observed before the TCP sender entered fast recovery. Whenever an out-of-sequence packet is classified as a retransmission by either rules R1.c or R2.b (the presence of 3 duplicate ACKs), and if the flag inf astRecovery is not currently set for this connection, then inf astRecovery is set to 1, also the value of the latest observed sequence number is stored in sndHigh. The above considerations are embedded in Rule R3 to allow us to identify TCP retransmissions during the fast recovery phase • R3: If rules R1 and R2 are not true, but the flag inf astRecovery is set, and the sequence number x < sndHigh then this out-of-sequence packet is a retransmission, as a result of TCP fast recovery.

2

these ACKs are usually called partial ACKs.

16

2.2.2 Reordering Rule R2.a, requires that t0 − t00 > RT O to indicate that sufficient time had passed for the out-of-sequence packet to possibly be a retransmission of an earlier packet. But what if this time lag is too small for a retransmission to have taken place, i.e., that we observe an out-of-sequence packet, but the interval of time between when it would have been observed (if it had been received in order) and when it is observed (out-of-order) is too short a time for the sender to have retransmitted the packet? In this case, the packet cannot be a retransmission, and the packet must have been mis-ordered within the network between the source and the measurement point. But how short a time is “too short a time?” Rather than require that the interval be less than RTO (which, as discussed in Section 3.4 is subject to a degree of uncertainty), we take a more conservative approach and require that this interval be less than RTT. Thus we have:

Rule R4

Assume an out-of-sequence packet (W 0 , x, t0 ) is observed, and again, the mea-

surement point has not previously observed a packet with sequence number x. Furthermore assume that none of the conditions R2a, R2b and R3 hold. In this case, we know (W 0 , x, t0 ) is not a retransmission. Again, let t00 be the earliest time at which a packet with a sequence number greater than x is observed. If t0 − t00 < RT T , then the observed packet (W 0 , x, t0 ) is classified as a re-ordered packet - a packet that was transmitted in order by the sender but reordered within the network before it reached the measurement point.

2.2.3 Duplicates Suppose now that we observe a packet, (W, x, t0 ) that has the same IPID and sequence number as an earlier-observed packet (W, x, t). Assume further that condition R1.a does not hold, i.e., that we have not seen enough duplicate ACKs to trigger a retransmission, and that the time interval t0 − t is smaller than the RTT. Given these conditions, the observed packet (W, x, t0 ) can not be a sender retransmission, and yet is identical (in IPID and

17

sequence number) to a recently observed packet. We classify such a packet as a networkgenerated duplicate. Rule R5

Assume (W, x, t) and (W, x, t0 ), t < t0 are observed at the measurement point.

Assume also that the number of duplicate ACKs observed after t but before t0 does not exceed the senders duplicate ACK threshold. Finally, assume that t0 − t < RT T . In this case, we classify the packet as a network-generated duplicate.

2.2.4 Unneeded retransmissions Figure 2.4 illustrates the case in which a packet with sequence number x and its acknowledgment (or the acknowledgment of a packet with a sequence higher than x, i.e., an acknowledgment “covering” this packet) are observed at the measurement point. Although the receiver has clearly received packet x, the sender may still retransmit the packet if either the ACK is lost between the measurement point and the sender, or if the sender timeouts prematurely. In either case, when a second packet is observed with sequence number number x, it is a retransmission - a retransmission that is not needed by the receiver. We thus have: Rule R6

If a packet (W, x, t) and an acknowledgment covering this packet have been

observed, and a second packet (W 0 , x, t0 ), with IPID different from any other of this connection, is observed, then the second packet is classified as an unneeded retransmission. In all other cases not satisfying rules R1 through R5, we classify the cause of the outof-sequence packet as being unknown. Rule R7

A packet not classified under R1 — R6 is classified as “unknown”. We will

shortly show that only a very small fraction of the observed out-of-sequence packets are not classifiable under rules R1 through R5.

18

Figure 2.4. An unneeded retransmission due to a lost ACK

2.3

Sources of estimation inaccuracy

Given the location of the measurement point in the middle of a connection’s end-end path, a set of potential errors is introduced as a result of the difference in information the measurement point observes, and what is actually observed by the sender and the receiver. More over, inaccuracy in our estimation techniques are also introduced due to end-host effects which are difficult to account for while making observations only in the middle of the end-end path. Some of these cases are: • Errors in assumptions: Our assumption is that reordering events happen at timescales smaller than a connection’s RTT. When this assumption is violated, the reordered packet may be classified as a retransmission. • Errors in RTT estimates: As stated in our discussion in Section 2.2, rules R1, R2, and R4 either directly or indirectly make use of the current value of the sender’s current RTT. Hence erroneous RTT estimates can affect the accuracy of our classification. • Incompleteness of observations From a single observation point along the end-end path it is not possible to observe all the events as a TCP sender or receiver would. As we described before, using additional information about the behavior of the TCP protocol, one can infer some of those events. However, there is still a set of out-

19

of-sequence events that can go completely undetected given that they are not manifested at the measurement point. For example, if the first packet of the connection (the SYN packet) is lost before our measurement point this event will go undetected. Likewise, the measurement point cannot detect the loss of an entire window of packets that occurs between the sender and the measurement point. In this case, since the measurement point does not observe any packets between the lost packets, and their subsequent retransmissions, it is not possible to identify the retransmissions as out-of-sequence packets. According to the rules described before, if a packet is reordered between the measurement point and the receiver, it goes undetected. However in Chapter 5, we attempt to also quantitatively account for reorderings downstream from the measurement point.

2.4

Conclusions

We have proposed a methodology to classify out-of-sequence packets from passive measurements taken at a single point in the end-end path. We have also discussed some fundamental challenges concerning the completeness and accuracy of such inferences given the location of the measurement point. We have empirically validated our heuristics, under a wide range of conditions, in a wide-area test bed. We shall describe our experiments in more detail in Chapter 4. In brief, we found that our inference techniques were very accurate, with only about 1% of all out-of-sequence packets misclassified across all loss rates. The measurement point was also able to capture all out-of-sequence phenomenon as long as the drop rate was less than 3%. However at higher loss rates (around 5-6%), about 5% of all out-of-sequence packets were not observed at the measurement point. Through a manual examination of the traces we confirmed that any misclassified or missing out-of-sequence packets were only because of one of the situations enumerated above. Though the validation outlined above lends confidence in the accuracy of our classification techniques, it still invites the question of whether the scenarios generated in the

20

limited confines of the experimental test bed were sufficiently exhaustive to test the techniques. Moreover, this concern takes on additional importance given the environment in which these techniques are supposed to be deployed - at passive measurement points that aggregate connections from a highly diverse set of source and destination points. The diversity and scale of the connections studied implies that our techniques should be comprehensively tested under many different scenarios. We address these concerns in Chapter 6, and through a combination of model-checking and formal reasoning about protocol behavior, we dig out all possible cases in which retransmitted and reordered packets are not detected at the measurement point.

21

CHAPTER 3 INFERRING TCP CONNECTION CHARACTERISTICS

Given that TCP (Transmission Control Protocol) is the dominant end-to-end transport protocol currently deployed in the Internet, there is currently great interest in understanding TCP’s performance and characterizing the factors (such as network congestion, delay, sender’s window, sender/receiver buffer limits, and sender data-starvation) that can limit its behavior in practice. In the previous chapter, we presented a methodology to infer measures of the end-end path that affect a TCP connection’s performance. In this chapter we shall approach this issue from a different perspective - how to infer characteristics of a TCP sender that influence its performance. We present a passive measurement methodology that observes the sender-to-receiver and receiver-to-sender segments in a TCP connection, and infers/tracks the time evolution of two critical sender variables: the sender’s congestion window (cwnd) and the connection round trip time (RT T ). With knowledge of these two values, many important characteristics of the sender, receiver, and the network path that connects them can be determined. For example, by comparing cwnd with the amount of data actually sent, one can determine when a TCP connection is starved for application-level data (i.e., that the connection could support a higher transfer rate, if more data were available); by carefully observing the manner in which cwnd changes in response to loss, one can identify non-conformant TCP senders, or the particular conformant flavor of TCP (e.g., Tahoe, Reno, New Reno); by monitoring RTTs, one can characterize RTT variability within and among flows, and determine the extent to which application-level adaptivity is needed to cope with variable

22

network delays. Also, as we have observed in the previous chapter, knowledge of a connection’s RTT is needed to distinguish between the various causes of out-of-sequence packets. The remainder of this chapter is organized as follows. Section 6.1 discusses related work. In Sections 3.2 and 3.3 we present our methodology to keep track of the sender’s congestion window, to infer the TCP sender’s flavor, and to compute the RTT. In Section 3.4 we identify events that can introduce uncertainty into our estimates. Finally, we conclude in Section 3.5 by outlining the empirical validation of our techniques, which will be described in a subsequent chapter, and their application to TCP traces collected in a backbone network.

3.1 Related work Numerous measurement studies have investigated the characteristics of TCP connections in the Internet. Many of the early studies [43] either actively measured end-to-end properties (e.g., loss, delay, throughput) of TCP connections, or passively characterized a connection’s aggregate properties (e.g., size, duration, throughput) [50]. More recently, researchers have focused their attention on specific details of TCP implementations. [40] develops a tool to characterize the TCP behavior of remote web servers. Their approach involves actively sending requests to web servers and dropping strategically chosen response packets in order to observe the server’s response to loss. They observe the prevalence of various TCP implementations, the presence of TCP options such as SACK and ECN, and identify conformant/non-conformant congestion control behaviors. Our approach, by contrast, is passive and is thus able to easily characterize large numbers of TCP connections. In [42], the author had the ability to run tcpdump over a set of end hosts. The paper describes a tool, tcpanaly, to analyze these traces, and reports on the differences in behavior of 8 different TCP implementations. Methodologically, our work is similar, in the sense that both involve passive observation of the behavior of TCP connection and the use of heuris-

23

tics to decide which flavor or implementation of TCP best matches the connection being observed. However, the scope of [42] is focused on highlighting the differences between various TCP implementation stacks. Since our measurement point is located in the middle of the end-end path, it is not possible for us to distinguish if a particular sender behavior is due to events in the network or end-system TCP implementation issues. Moreover, since we track several millions of highly diverse TCP connections, we do not concern ourselves with implementation-level details of the senders. Our main goal is to track the sender’s congestion window. We only seek to detect the cases in which our estimate of the congestion window may differ from that of the sender; we do not perform a detailed case-by-case analysis of the reasons for this difference. Also, in [42], the analyzed traces involve bulk file transfers, hence the author did not have to account for effects of sender and application behavior. This aspect is discussed in some detail in our work. The location of the observation point also introduces methodological challenges in estimating a connection’s RTT (in contrast to measurements taken at the end hosts) and we propose a technique to address this issue in this work. Finally, our study is much larger in scale and more diverse. Another work of interest is [25], which presents a methodology to classify out-ofsequence packets,with the same measurement environment as in our current work. As discussed in section IV, we use the methodology in [25] to identify packet retransmissions in our traces. [52] is the work that is perhaps most closely related to this present work. In [52], the authors passively monitor TCP connections and develop heuristics that are used to classify connections according to the factor(s) that limit their throughput. A technique is also proposed to estimate RTT by selecting a value (from among a set of predetermined values) that most closely matches the observed packet flight dynamics. Our work differs from [52] in several important respects. Most importantly, our goal is not to study the ratelimiting factors of TCP, but more fundamentally to develop a methodology for estimating cwnd and RT T. These are arguably the two most important pieces of TCP sender state.

24

As we will see, knowledge of these values will allow us to study many characteristics of TCP connections. These values can be used to determine the factors that limit a TCP’s throughput. These values can also be used to detect non-conforming TCP senders, and to determine the extent to which various “flavors” of TCP are used in practice. These values can also be used to determine how often a newer version of TCP is able to exercise its enhanced capabilities (e.g., how often NewReno’s fast recovery mechanism is actually used in practice). In cases where our work overlaps with [52] (e.g., in determining the factors that limit a TCP connection’s throughput), a direct comparison is not currently possible, as the tools in [52] have not yet been released. We conjecture, however, that since the techniques in [52] and in this present work are quite complementary, their combined use will allow for even better classification of TCP behaviors than either tool alone. Several recent efforts have considered the problem of estimating the RTT of a connection using passive measurements [29, 35]. These works compute one RTT sample per TCP connection, either during the triple-handshake or during the slow-start phase. Our works extends these efforts in that it computes RTT estimates throughout a connection’s lifetime.

3.2 Keeping track of the congestion window The basic idea behind tracking the sender’s variables is to construct a “replica” of the TCP sender’s state for each TCP connection observed at the measurement point. The replica takes the form of a finite state machine (FSM). The replica FSM updates its current estimate of the sender’s cwnd based on observed sender-to-receiver data segments and receiver-to-sender ACKs, which (if received at the sender) would cause the sender to change state. Transitions are also caused by detecting loss events at the sender (either through a timeout, or through TCP’s fast retransmit mechanism). These loss events will be manifested in the form of sender-to-receiver retransmissions, which are detected using the classification heuristics we outlined in Chapter 2. Estimating the state of a distant sender poses many challenges:

25

• In order to process large amounts of data (i.e., hundreds of GBytes), a replica can only perform limited processing and maintain minimal state. Our replica thus works in a “streaming” fashion; it can neither backtrack nor reverse previous state transitions; • Given its position in the “middle” of an end-to-end path, a replica may not observe the same sequence of packets as the sender. ACKs observed at the measurement point may not reach the sender. Additionally, packets sent from the sender may be lost, and these loss events may not be detected at the measurement point. • The manner in which cwnd is modified after packet loss is dictated by the flavor of the sender’s congestion control algorithm. We consider the 3 major flavors of TCP congestion control - Tahoe, Reno and NewReno1 - and instantiate three different FSMs, one for each flavor. Based on the observed sender behavior in response to a loss, we will decide which of these FSMs best represents the sender. • Implementation details of the TCP sender, as well as the use of TCP options, are invisible to the replica2 . All of the considerations above may introduce uncertainties into cwnd estimation, which we discuss in more detail in Section 3.4. Several variables must be initialized in the replica FSM. The sender’s initial congestion window size, icwnd, is the maximum number of bytes that a sender can transmit after completing the triple-handshake and before any ACKs arrive. The typical value of icwnd can be up to twice the maximum segment size [6].3 We estimate icwnd by keeping a count of the number of TCP segments observed before seeing a needed ACK. We also initialize 1 There exist other implementations such as TCP Vegas [14] and TCP Westwood [16], but we are not aware of any widely used OS stacks that implement these algorithms, hence we drop these from our study. 2 Specifically in our case, the traces collected in Sprint’s backbone [22] that we shall use in our study, only contain the first 44 bytes of all packets. Thus, for TCP packets we have only access to the first 4 bytes of the payload. 3

An experimental TCP specification allows icwnd to be as high as twice again this value [5].

26

the slow-start threshold (ssthresh) to an extremely large value, as it is commonly done in TCP stacks [6].4 . During normal operation, a TCP sender can be in slow-start or congestion avoidance. The arrival of a new ACK increases cwnd by 1 or by 1/cwnd, respectively. If the sender detects a loss via timeout, it sets cwnd to 1 and ssthresh to max(min(awnd, cwnd)/2, 2), where awnd is the receiver advertised window. The more interesting case is when packet loss is detected via the receipt of three duplicate ACKs. This event brings out the differences between the three flavors under consideration: Tahoe. A Tahoe sender reacts to the receipt of three duplicate ACKs with a so-called fast retransmit, behaving exactly as if the retransmission timeout had expired. Reno. Reno TCP adds fast recovery to Tahoe’s fast retransmit algorithm [6]. Fast recovery works on the assumption that each duplicate ACK is an indication that another packet has successfully reached the receiver. The sender adjusts its cwnd to account for this fact: ssthresh is set to max(min(awnd, cwnd)/2, 2)5 and cwnd is set to ssthresh + 3. Thereafter, the sender increments the cwnd by 1 for every new duplicate ACK received. Once the sender receives a new ACK, it resets the value of cwnd to ssthresh, and exits fast recovery, returning to congestion avoidance. NewReno. NewReno introduces a simple change in Reno’s fast recovery mechanism [21] by removing the need to detect a loss through timeout when multiple losses occur within a single congestion window. The change occurs when the sender receives a new ACK while in the recovery phase. In NewReno the sender checks whether this is a partial new ACK, i.e., whether the ACK acknowledges only some of packets sent before fast retransmit. If the ACK is partial, the sender immediately retransmits the packet requested by the receiver In some TCP implementations the sender initializes the value of ssthresh from its route cache, an issue we discuss in Section 3.4. 4

RFC 2581 instructs that, after a loss, ssthresh should be set to max(f lightsize/2, 2), where f lightsize is the number of packets currently unacknowledged. We choose min(awnd, cwnd) instead of f lightsize to follow the current implement of TCP in Linux and FreeBSD. 5

27

and remains in fast recovery. This behavior ensures that the sender is able to retransmit a lost packet after every RTT without the timeout mechanism. The NewReno sender remains in this phase until it receives an ACK that acknowledges all outstanding packets before the recovery phase, at which point it returns to congestion avoidance.

3.2.1 TCP flavor identification As noted above, the three flavors of TCP can respond differently to loss events. In order to determine the flavor of TCP implemented at a sender, we exploit the fact that a TCP sender can never have more outstanding unacknowledged (“in flight”) packets than its usable window size. A sender’s usable window size is the smaller of cwnd and the window advertised by the receiver. This forms the basis for our test to identify the sender’s flavor. For every data packet sent by the sender, we check whether this packet is allowed by the FSM’s current estimate of cwnd for each particular flavor. Given a flavor, if the observed packet is not allowed, then the observed data packet represents a “violation” - an event that is not allowed. We maintain a count of the number of such violations incurred by each of the candidate flavors. The sender’s flavor is inferred to be that flavor with the minimum number of violations, and, at any time during the life of a connection, the sender’s congestion window is the value of cwnd as estimated for this flavor. If no violations are observed for any TCP flavor, we say that they are indistinguishable. In Chapter 5 we quantify the extent to which various flavors of TCP are observed in our traces.

3.2.1.1

Use of SACK and ECN

TCP sender behavior also depends on two options whose deployment is reported to be growing fast [40]: Selective Acknowledgments (SACK) [36] and Explicit Congestion Notification (ECN) [44]. The TCP SACK option allows recovery from multiple losses in a single congestion window without timeout. SACK, by itself, does not change the congestion window dynamics of the sender, i.e. it only helps in deciding what to send, not when to send it. Our 28

measurement points do not have access to SACK blocks. In some cases, it is possible to detect the presence of SACK blocks and/or infer the use of SACK information during fast recovery. Detecting the use SACK information is part of our ongoing work. An ECN-capable sender explicitly notifies the receiver of every reduction of the congestion window for any reason (fast retransmit, retransmission timeout ECN-echo flagged acknowledgment). The measurement point could estimate the congestion window of the sender just by looking at the ECN bits in the TCP header. Unfortunately, ECN is still not widely deployed by end-hosts. In the packet traces we studied, only 0.14% of the connections were ECN-aware. We have presented an overview of our approach to keep track of the sender’s congestion window; a description of the finite state machines for each TCP flavor tracked by the measurement point, and details of the heuristics to infer a TCP sender’s congestion control flavor are presented in [26].

3.3 Round-trip time estimation In this section we describe a technique to compute RTT samples throughout the lifetime of a TCP connection, leveraging the cwnd estimation techniques from the previous section. The implementation of this method is simple and in some cases results in as many RTT samples as would be computed by the actual TCP sender. We only provide a brief overview of our technique here; once again, the reader is referred to [26] for a more detailed description. The basic idea behind our RTT estimation technique is illustrated in Figure 3.1. Since we are not able to directly measure the sender’s RTT sample shown in the left of Figure 3.1, we instead measure (i) the round trip delay from the measurement point to the receiver and then back to the measurement point (labeled d1 in the figure), and (ii), the round trip delay between the measurement point, the sender and then back to the measurement point

29

Data packet Sender’s RTT sample

d1 Inferred RTT sample

ACK d2 Data packet triggered by this ACK

Sender

Receiver

Figure 3.1. TCP running sample based RTT estimation

(labeled d2 in the figure). The sum of these two delays d1 + d2, as shown in Figure 3.1, is our estimate of the RTT. As illustrated in the figure, d1 is simply the difference in time when a data packet and subsequently the ACK covering this data packet is observed at the measurement point. However, computing d2 requires the ability to determine which data packet transmissions are triggered by the arrival of a particular ACK at the sender. This requires an accurate estimate of the sender’s cwnd and it is here that our techniques from the previous section come into play. We refer to our method as a running RTT estimation technique, since it continuously makes RTT estimates, based on the measured values of d1 and d2 throughout the TCP connection’s lifetime. An important aspect of our technique is its ability to stop (and restart) the RTT estimation as a sender recovers from a loss in order to closely emulate the behavior of the actual TCP sender (which does not compute the RTT during the loss recovery). In order to do this, our technique relies on the knowledge of the state of the TCP connection - whether the sender is recovering from loss or not. We refer the reader to Appendix A for further details and pseudo-code of the RTT estimation technique. Let us now consider some of the factors which affect the accuracy of these RTT estimates. • Variable delays along the end-end path: Figure 3.2 illustrates that the RTT observed at the measurement point and the RTT observed at the sender can differ. The 30

Figure 3.2. Differences in RTT estimation

sender transmits a first packet at time s1 , and the packet is observed at the measurement point at time s1 + d1 . Now, suppose the sender sends a second packet (as the result of having received an acknowledgment for the first packet) at time s2 . This second packet arrives at our measurement point at time s2 + d2 . In this case, the measured RTT at the sender is s2 − s1 , while at our measurement point, we compute an RTT of (s2 − s1 ) + (d2 − d1 ). Depending on whether d2 is greater or smaller than d1 , the observed RTT will either overestimate or underestimate the sender-observed RTT. • End-host delays: A related problem is the fact that we cannot estimate the delays within the end hosts themselves between the receipt of an acknowledgment and the transmission of the data packet triggered as the result of the ACK receipt. However, delays in the receiver’s operating system are an inherent component of the TCP sender-measured RTT as well. • Errors in cwnd: RTT estimation is directly affected by estimation inaccuracies in cwnd. Our methodology needs to know cwnd in order to identify the data packet that is triggered by the receipt of the ACK used in the estimation. Therefore, over(under-) estimation of cwnd will result in an over- (under-) estimation of the RTT. In the subsequent section we shall discus the events which result in an error in the estimation of sender state and value of congestion window. 31

3.4

Sources of estimation uncertainty

The idea behind replicating sender state, and tracking various TCP flavors is not complicated. However, the measurement point has only partial information about the TCP connections it observes. Given its location in the middle of the path, it may not observe the same events the senders do, and vice versa. Moreover, it assumes complete knowledge of the TCP stack implementations that may instead present subtle (or malicious) differences [40, 42]. These two characteristics of our measurement methodology can introduce uncertainties in our estimate of the sender state. A significant aspect of our work is to understand how these issues impact our approach (or any passive measurement approach, in the middle of the end-end path). In order to address these issues, we start by identifying events in the network that can result in an ambiguous or erroneous state at the measurement point. Such events will lead to an error in the measurement point’s estimate of the sender variables, and can potentially impact the accuracy of the RTT estimation mechanism. We discuss how our methodology can detect such events and we also provide an empirical quantitative upper bound on their occurrence - as observed in the traces collected in the Sprint backbone, on which we shall apply these techniques.

3.4.1

Under-estimation of cwnd

The measurement point may under-estimate cwnd if it observes three duplicate ACKs that never reach the sender. According to our methodology, the measurement point would infer that the sender will undergo a fast retransmit and reduce cwnd and ssthresh accordingly. The sender will eventually timeout and then retransmit the packet. At this point the measurement point will detect the timeout and reduce the ssthresh twice (once for the fast retransmit and then as part of the timeout) while the sender would do so only once. Note that even if the measurement point detects the timeout, it cannot later reverse its decision. Indeed, the measurement point would observe the same sequence of packets if

32

the third duplicate ACK is lost (ssthresh is modified only once) or if the packet sent due to fast retransmit is lost (ssthresh is modified twice). On the other hand, detection of these cases is relatively simple. A sender, if greedy, will transmit more packets than the estimated cwnd would permit. That sender would then trigger violations in all the TCP flavor FSMs. This way, the measurement point can identify the connections for which the congestion window is uncertain. In order to quantify the frequency of these events, we counted the number of senders that incur in violations in all the flavors in the packet traces we study. Only 0.01% of all senders show this behavior. If we focus on senders with more than 13 packets to send6 this percentage goes up to 5%. This is expected given that the more packets a sender transmits, higher the likelihood that it will incur the loss scenario described above.

3.4.2 Over-estimation of cwnd Two events will lead to an over estimate of the congestion window size. Acknowledgments lost after the measurement point. Every ACK for new data observed at the measurement point causes an increment of cwnd by as much as one (depending on whether the sender is in slow-start or congestion avoidance). If the ACK is lost before reaching the sender, there will not be a corresponding increment at the sender, resulting in an over-estimation of the sender’s congestion window. Moreover, during fast recovery, any extra duplicate ACKs observed at the measurement point cause cwnd to be increased by one. Again, if such ACKs are lost before they reach the sender, but after they are observed at the measurement point, the measurement point will over-estimate the sender’s cwnd. Window of data packets lost before the measurement point. An entire window of packets transmitted by the sender and dropped before the measurement point will remain 6

Senders with at least 13 packets may experience a fast retransmit. In fact, a sender with an initial window of 2 packets and in presence of delayed ACKs would send 13 packets in 4 flights of 2, 3, 3, and 5 packets, respectively.

33

undetected. This is because, in order to detect a loss, the measurement point needs either to observe packets from the receiver related to the packet drops (in the form of duplicate ACKs) or earlier transmissions of the data packets. In this case the sender will timeout and update the values of ssthresh and cwnd while the measurement point will maintain the old values. The loss of the initial SYN packet is a special case of this event that will make the measurement point over-estimate ssthresh (set by the sender to 2 and thus incorrectly consider the sender to be in slow-start for longer than needed. The detection of over-estimation events is particularly difficult. In fact, a sender for which the measurement point has a larger estimate of cwnd would just appear as a “notgreedy” sender, i.e. with a rate limited by the application or by the kernel sending buffer size. In Chapter 5 we consider how to detect such senders, and our traces show that around 30% of the senders not-greedy at some point in their lifetime.

3.4.3 Window scaling The sender window is the minimum of cwnd and the receiver-advertised window, awnd. Unfortunately, we collect only the first 44 bytes of the packets and thus can not track the advertised window for all connections. Indeed, for those connections that use the window scale option [24], capturing only the first 44 bytes hides the scale factor makes it impossible to know the exact value of the receiver window. In order to estimate how often this problem occurs we first count the number of connections that could be using the window scale option; then, we count the connections for which cwnd could exceed awnd. The window scaling standard [24] requires the end-hosts to negotiate the use of this option during the initial connection setup. Therefore, the size of the SYN and SYN+ACK packet can be used to infer if those packets might include the window scale option (that consumes 3 bytes).

34

For these connections, given that the window scale option can only “scale up” the window size, we have at least a lower bound on the size of the receiver window. Therefore, we can identify all the connections for which we are uncertain of the sender window as those where cwnd reaches the lower bound of awnd. In the packet traces under study, our measurement point is uncertain of the sender window, at some point in the connection’s lifetime, for approximately 2% of the connections. In general, also for these connections, our methodology allows us to track correctly cwnd as long as it is below the lower bound of awnd. When cwnd exceeds the lower bound of awnd we may underestimate cwnd, as discussed previously.

3.4.4 Incorrect implementations Several previous works [40, 42] have uncovered bugs in the TCP implementations of various OS stacks. For example, using the TBIT tool, the authors in [40] found that as many as 8% of web servers probed used a version of Reno that would behave more aggressively than Reno during the recovery phase, and about 3% of web servers did not cut down their window after a loss. These implementation issues affect our ability to track the congestion window and RTT of a TCP sender. It is impractical to detect and track all possible TCP implementations. On the other hand, we can use the same techniques described above to identify inaccuracies in the cwnd estimate to detect not-compliant TCP implementation and interrupt our RTT and cwnd estimation. For example, a sender that does not react to the receipt of three duplicate ACKs will later violate all TCP flavors and this non-conformant behavior will be detected by our measurement point.

3.5 Conclusions We have presented a passive measurement methodology that observes the segments in a TCP connection and infers/tracks the time evolution of two critical sender variables: the

35

sender’s congestion window (cwnd) and the connection round trip time (RT T ). If the connection suffers a loss, our methodology also detects the flavor of the congestion control algorithm implemented by the sender. We have also identified the difficulties involved in tracking the state of a distant sender and described the network events that, given the location of the measurement point, may introduce uncertainty into our estimation. In order to cope with this uncertainty, we have outlined ways for the measurement point to detect if its estimate of the sender’s variables are in error, thus preventing the error from impacting other inferences. Our techniques have been validated using simulations and experiments over the Internet that will be described in detail in Chapter 4. In summary, in both the simulations, and real world experiments, the error in the cwnd estimate is less than 5-15% for almost 95% of the senders (depending on the loss rate). In absolute terms, for more than 90% of the senders, the error is less than 0.5 packets. The RTT errors are also small: 95% of the senders have an error of less than 15%, with the error never exceeding 25%. Since, our inference techniques assume the measurement point itself can be anywhere between the sender and the receiver, we can apply our methodology to packet traces collected in the middle of large networks. In Chapter 5, we apply our methodology to several millions of TCP connections gathered within the Sprint IP backbone. We present results on the distributions of congestion window sizes and RTTs in the observed TCP connections. We also examine the reasons that impact the throughput of TCP sender and the impact of TCP congestion control flavors on the effectiveness of TCP sender’s recovery from loss.

36

CHAPTER 4 VALIDATION

In the previous two chapters we have described a passive measurement based methodology to compute performance measures of a TCP connection’s end-end path and characteristics of the TCP sender itself. In this chapters we shall describe our efforts to validated the previously described techniques. We begin by describing the results from ns based simulations, and then describe an empirical validation study over a wide area test bed.

4.1

Simulations

In our simulations we used the ns simulator, containing a typical “dumbbell” topology with a single bottleneck link. The duration of the simulation was 300 seconds. We generated long lived flows (i.e. their lifetime was the entire duration of the simulation) for analysis and cross traffic consisting of 5,700 short lived flows (40 packets) with arrival times uniformly distributed throughout the simulation. We did sets of experiments with the bottleneck link located between the sender and the measurement node, and with the bottleneck located after the measurement point. We also ran simulations with different parameters for the bottleneck link, varying the bandwidth, buffer size and propagation delay. The average loss rate in the various scenarios varied from 2% to 4%. For each sender we compare the RTT estimates obtained with our methodology to the values computed by the ns sender. Estimated cwnd values are compared to those maintained at the sender every time the sender’s window changes. Given a series of RTT estimates and observations we compute the average of the relative errors for each sender. Figure 4.1 plots the cumulative distribution of the mean relative errors for RTT and cwnd. 37

Cumulative fraction of senders

1 0.9 0.8 0.7 0.6 0.5 cwnd RTT

0.4 0.3

0

5

10

15

20

25

Mean Relative Error (%)

Figure 4.1. Mean relative error of cwnd and RTT estimates in the simulations

The estimated error for cwnd is shown to be less than 5% for more than 95% of the senders. Only one sender out of the 280 under study incurred an error larger than 20%. In absolute terms, 265 senders have an average cwnd of less than 0.5 packets, while only one had an error of more than one packet. The RTT estimate error is less than 10% for 90% of the senders, and never exceeds 25% of the actual value. Simulations also provide insight into how accurately our methodology identifies the TCP flavor. Out of the 280 senders, the TCP flavor of 271 senders was identified correctly. Of the remaining senders, four either had zero violations for all flavors (i.e., they did not suffer a specific loss scenario that allows us to distinguish among the flavors) or had an equal number of violations in more than one flavor (including the correct one). We also misclassified 5 connections. This occurs if the FSM corresponding to the TCP sender’s flavor underestimates the sender’s congestion window, for example, as in the scenario described in Chapter 3. Thus, this FSM may experience more violations in its estimate of the sender’s cwnd and lending an incorrect identification of the sender’s flavor. We also observed that, by increasing the duration of the simulations, we were able to increase the number of connections for which we correctly identified the flavor. This is expected, since the more packets a sender transmits, the higher the probability that the sender would exhibit the behavior peculiar to its flavor.

38

Figure 4.2. Validation experiments setup

4.2

Experiments over the network

We now validate our techniques through measurements done in a controlled testbed. Our setup (as illustrated in Figure 4.2) consists of PCs running the FreeBSD 4.3 and 4.7 operating systems, with a modified kernel that exports the connection variables to user space using the sysctl facility. The PCs were located at the University of Massachusetts, in Amherst, MA and Sprint ATL, in Burlingame, CA, labeled “Host A” and “Host B” in Fig. 4.2. Traffic between these two sites passed through an OC-3 access link in Stockton, CA, which was also passively monitored by an IPMON [22] system. The IPMON monitor was our passive measurement site between the sender and the receiver. Our experiments consisted of setting up TCP connections (divided between Reno and NewReno flavors), carrying out bulk data transfers. In order to study the accuracy of our techniques under varying packet loss and reordering rates we used the dummynet emulator [45] at different points in the end-end path to induce loss and reordering events. To generate more packet drops we configured a dummynet “pipe” as a congested bottleneck link. In order to reorder packets we configured two dummynet pipes with variable delays; packets from a flow were probabilistically assigned to traverse either of these two pipes. This emulates multi-path routing, which result in packets getting reordered. We varied the dummynet configuration to create 3 sets of experiments: one with no packet drops, one with approximately a 3% drop rate, and a third with a 5% drop rate. All of these experiments had between 1 - 1.5% of packets sent through the alternate dummynet pipe to induce reordering. We also placed the emulated

39

1

Cumulative fraction of senders

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

20

25

Mean Relative Error (%)

30 < 1% loss 3% loss 5% loss wireless

Figure 4.3. Cumulative error of RTT inference scheme 1

Cumulative fraction of senders

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

Mean Relative Error (%)

20

25

< 1% loss 30 3% loss 5% loss real

Figure 4.4. Cumulative error in cwnd estimation

bottleneck link to be either between the sender and the measurement point, or placed after the measurement point and before the receiver. We also carried out experiments with actual packet drops along the end-end path between Amherst and Burlingame. We ran a set of experiments at different times of the day, over several days, with bulk transfers lasting between 10 seconds and 5 minutes. However we observed very low loss rates (rarely exceeding 0.1%) experienced by the connections. We then did another set of experiments with an end-host behind a wireless hop. This resulted in packet drops of around 4%, and discussions about these experiments are included. Figure 4.4 plots the mean relative error for cwnd and RTT. The error in the cwnd estimate is less than 20% for 90 - 100% of the senders. In absolute terms, again between 40

90 - 100% of the senders exhibit an error of less than 0.5 packets. This level of error is likely due to rounding errors between the measurement point and the TCP stacks in the kernels under study. We manually checked the causes behind when there was a discrepancy between the measurement point’s and the sender’s estimate of the congestion window. We then verified that this discrepancy was always due to one of the events that are discussed above, resulting in an over or under estimate of the sender’s window. As noted above, it is possible for the measurement point to detect this error. If we observe that the sender sends more data packets than predicted by the congestion window estimate, it is an indication that the measurement point has under-estimated the congestion window size. Similarly, as we will see in more detail in Chapter 5, by comparing the number of packets sent by a sender in a particular round, to what is allowable by the current window value, we can also identify non-greedy senders, or instances of an over-estimation of the window. In either case, since the estimate of the connection’s RTT depends on an accurate estimate of the cwnd, we stop taking RTT estimates when we detect that sender behavior is not in accordance with our estimate of cwnd. Let us now look at the performance of our RTT estimation technique. For each sender we compare the RTT samples estimated using our technique at the IPMON measurement point, with the values computed by the sender. Let pi be the latest packet observed at the measurement point, and let spi and mpi be the most recent RTT sample (before the arrival of pi ) taken at the TCP sender and computed using the running estimation technique at the measurement point respectively. Given this series of estimates and sender values, we P |s −m | compute the average of the relative errors for each sender as n1 ni=1 ( pi sp pi ), where n is i

the total number of packets observed at the measurement point. Figure 4.3 plots the cumulative distribution of the mean relative errors for RTT. The relative error in RTT, for drop rates less than 3%, is less than 15% for 90% of the senders, and for a loss rate of around 5%, the error is less than 20% for 90% of the senders. The magnitude of this error is sufficiently small for us to accurately distinguish between re-

41

1

Cumulative fraction of senders

0.9 0.8 0.7 0.6 0.5 0.4 < 1% loss 3% loss 5% loss real

0.3 0.2 0.1 0

0

5

10

15

20

25

30

Mean Relative Error (%)

Figure 4.5. Cumulative error of RTT inference scheme - different metric

transmissions and reorderings (based on our assumptions about network behavior), since the value of a TCP retransmission timer is usually several times larger than the connection’s RTT, and reordering is assumed to occur at timescales smaller than the RTT of the connection. Finally, we observe that our estimates of the RTTs in the presence of packet drops in the wireless hop are also very accurate, with errors never exceeding 20%. Even though the packet drop rate in this set of experiments is fairly high - around 4%, there was very little variability in the RTTs along the lifetime of the connections (since there was little congestion along the end-end path), which contributes to the accuracy of our estimates. We have also computed the mean RTT error using an alternate metric. We now compare RTT samples computed by the measurement point only with the corresponding samples computed by the sender. We match sender and measurement point samples using the sequence numbers of the packets that constitute a particular sample. Let {s1 , s2 , ..., sn } and {m1 , m2 , ..., mn } be the corresponding RTT samples taken by a TCP sender, and those computed using the running estimation technique at the measurement point. Given this series of estimated and sender values, we compute the average of the relative errors for P i| each sender as n1 ni=1 ( |pi −m ), where n is the total number of packets observed at the pi measurement point. Figure 4.5 plots the cumulative distribution of the mean relative errors for RTT with this metric. Even for traces with a drop rate of around 5% , the relative error in RTT is less than 10-15% for nearly 95% of the senders. And once again, we observe that

42

Loss rate Data packets Classification Out-of-sequence Retransmissions Reorderings Unknowns

∼0% 15,143 204 93 111

Inferred (1.35%) (45.59%) (54.41%) (0.00%)

204 91 113

∼5% 17,080

2 - 3% 17,411 Actual (1.35%) (44.61%) (55.39%) (0.00%)

618 487 131

Inferred (3.55%) (78.80%) (21.20%) (0.00%)

628 494 134

Actual (3.61%) (78.66%) (21.34%) (0.00%)

1,111 1013 97

Inferred (6.50%) (91.18%) (8.73%) (0.00%)

1,169 1061 108

Actual (6.84%) (90.76%) (9.24%) (0.00%)

Table 4.1. Out-of-sequence packets inference and classification with dummynet induced losses

our estimates of the RTT in the presence of packet drops in the wireless hop are also very accurate, with the error never exceeding 15%. We looked into the flows which had any inaccurate RTT estimates and found in some cases the error in RTT estimates was a consequence of errors in cwnd estimates. The errors in the cwnd estimates were detected (by comparing with sender behavior) only after a delay of more than one RTT, allowing an inaccurate RTT estimate to be computed. Also, our RTT estimation technique assumes that a packet is triggered immediately upon the arrival of the ACK at the end-host. We found the other source of error was due to delays within the end hosts between the receipt of an acknowledgment and the transmission of the data packet triggered as the result of the ACK receipt. Given the fact the we have a fairly accurate estimate of the TCP connection’s RTT, we now turn our attention to the accuracy of the out-of-sequence classification methodology. In order to carry out the validation we collect packet traces using tcpdump at the end-host where the TCP sender resides. We process the sender-end tcpdump traces using tcptrace, a tool which can recreate TCP flow level information from tcpdump traces collected at endhosts. We start with the classification results from the validation experiments using only dummynet to induce loss and reordering, as presented in Table 4.11 . We observe that we are able to identify nearly all out-of-sequence packets when the packet drop rate is less than 1%. However, at a higher drop rate of around 3%, the measurement point does not observe 1

Even in the case when there were no packet drops, there exist packet retransmissions. This is due to the reordering induced on the end-end path. If the extent of the reordering is such that it triggers more than 3 duplicate ACKs from the receiver, it would invoke TCP’s fast retransmit mechanism in the sender causing a retransmission.

43

Loss rate Data packets Classification Out-of-sequence Retransmissions

∼4% 11,205 147 147

Inferred (1.31%) (100%)

155 155

Actual (1.38%) (100%)

(a) With no reordering

Loss rate Data packets Classification Out-of-sequence Retransmissions Reorderings Unknowns

4-5% 17,454 1119 835 284

Inferred (6.41%) (74.62%) (25.38%) (0.00%)

1135 847 288

Actual (6.50%) (74.63%) (25.37%) (0.00%)

(b) With dummynet induced reordering

Table 4.2. Out-of-sequence packet inference and classification with losses at wireless hop

10 out of 628 out-of-sequence packets (1.6% of all out-of-sequence packets), and at a drop rate of 5% it does not observe 58 out of the 1169 packets that were missequenced along the end-end path (5% of all out-of-sequence packets). We have examined the traces and confirmed, as discussed in Chapter 3.4, that this happens when an entire window of packets is lost before the measurement point, or if packet that has been reordered is lost before it is observed at the measurement point. We now look at the frequency with which we misclassify out-of-sequence packets. We observe that two out of 204, three out of 628 and five out of 1,111 out-of-sequence packets are misclassified across loss rates of 0%, 4% and 5% respectively. This occurs if the timelag of a reordered packet is on the order of the connection RTT and the packet crosses the measurement point after three duplicate ACKs (for the reordered packet) were observed at the measurement point. In this case, our methodology considers the packet to be a retransmission. We now examine the accuracy of the classification scheme in the presence of packet loss along the end-end path, with no dummynet between the end hosts. Almost all packet drops in this experiment were at the wireless last hop. In our first set of experiments, we observed that out of 155 out-of-sequence packets, 147 were detected at the measurement point, and all were correctly classified to be retransmissions. Once again, we confirmed that the missing out-of-sequence packets were due to the loss of an entire window of packets before the measurement point. Since, we observed no reordering along the end-end path in this experiment, we conducted additional experiments in which we induced reordering in

44

about 1% of all packets using dummynet. Combining the results from these experiments, we observe that we are also able to classify the out-of-sequence packets in this experiment with a high degree of accuracy - only 3 out of 1119 observed out-of-sequence packets were misclassified, and we confirmed this was again because the time-scale of the reordering event was longer than a RTT, and triggered more than 3 duplicate ACKs from the receiver, as discussed earlier in the section.

4.3

Conclusions

In summary, our simulations and live Internet experiments have demonstrated the accuracy of our methodology for tracking cwnd and RT T. The relative errors were found to be small, often on the order of (and possibly due to) rounding errors or clock tick precision. Also, our classification scheme classifies out-of-sequence packets accurately, with only 1% of all observed out-of-sequence packets misclassified, across different loss rates. In the next section, we apply our techniques and analyze TCP sender and end-end performance through packet traces collected in the Sprint Tier-1 backbone network.

45

CHAPTER 5 APPLICATION OF INFERENCE TECHNIQUES TO TRACES FROM A TIER-1 BACKBONE

We now apply our techniques to packet-level traces collected as part of the IPMON project within the Sprint IP backbone. First, we quantify the occurrence of out-of-sequence phenomena and their cause, in TCP connections going over the Sprint backbone. Then we examine several questions about the characteristics of TCP senders in these traces. We present results on the distributions of congestion window sizes and RTTs in the observed TCP connections. We infer how often different flavors of TCP are observed in practice and also construct heuristics to ascertain if a TCP sender is greedy or limited by the application, and quantify the number of non-greedy senders in our traces.

5.1 Measurement Infrastructure Our measurements were obtained using infrastructure developed as part of the Sprint IP Monitoring (IPMON) project [22]. The IPMON measurement system provides packetlevel traces from OC-3, OC-12 and OC-48 links in several Points-of-Presence (POPs) in the Sprint backbone. The measurement systems themselves are connected via an optical splitter to the links under study so that all packets traversing a link are passed on to the monitoring equipment. A packet capture card copies all the packet headers to disk along with a timestamp generated by an accurate GPS-synchronized clock.

5.1.1

Measurement data

We present the results of the classification analysis over a set of four packet traces collected on November 21st 2002. We have observed similar trends to those reported below in 46

Link speed Duration (hours) Unique source ASes TCP connections Percentage of all TCP connx TCP Data packets

CDN OC-12 (622 Mbps) 6 4,752 18M 99.67% 335M

Tier-1 ISP OC-12 (622 Mbps) 6 520 4M 12.20% 53M

Intra-POP OC-48 (2.5 Gbps) 1 4,837 6.3M 25.53% 74M

East Coast OC-48 (2.5 Gbps) 2 1,561 1.7M 9.20% 40M

Table 5.1. Summary of the traces used for OoS classification

many other traces collected on different dates and over different links (see [25] for results on a different set of packet traces). We identify the traces under study as “CDN”, “Tier-1 ISP”, “East Coast” and “Intra-POP” depending on the nature of the customer that contributes the most to the data traffic to that particular trace, or the location in the backbone network where the trace was collected. “CDN” refers to a content distribution network hosting web servers that (given the design of the CDN) should receive requests primarily from clients that are relatively “close” to the servers1 . “Tier-1 ISP” refers to a link connecting to another service provider that carries traffic coming from (and destined to) a very large and diverse set of end hosts. The last two traces refer to OC-48 (2.5 Gbps) links on the East Coast of the U.S. (named East Coast) and inside the New York PoP (Intra-POP). Table 5.6 summarizes the characteristics of the four traces. We believe these traces provide a highly representative sample of the Internet traffic, with the connections originating in a significant percentage of the total number of ASes in the Internet. We retrieved the BGP table from Sprint backbone routers during the trace collection, and used the AS path information to derive the source and destination ASes of the TCP connections. Overall, the traces contain TCP connections originating in 7,664 unique Autonomous Systems2 . This represents about the 60% of the total number of allocated ASes at the time of the trace collection (we counted 12,822 unique Internet AS numbers as of November 21st, 2002). 1

The “vicinity” of a client may depend on the number of router hops, on the round trip time or on the length of the AS path from the source to the destination. 2

The sum of the unique AS numbers from Table 5.6 is higher (11,670) because we may have the same AS present in more than one trace.

47

60

CDN

40 20 0

32

64

128

192

256

15

Tier−1

Frequency (%)

10 5 0

32

64

128

192

256

30

Intra−POP 20 10 0

32

64

128

192

256

15

East Coast

10 5 0

32

64

128

192

256

Time to Live (TTL)

Figure 5.1. Distribution of time-to-live values

Another interesting characteristic of our traces is the relative position of the collection point in the end-end path between the source and the destination. In order to explore this further, we used the time-to-live (TTL) values present in the IP packets to determine the distance (in terms of the number of router hops) from the two end hosts to the measurement point. Figure 5.1 shows the observed distribution of TTL of all connections in the four traces3 . We note that TTL values tend to aggregate around a few well-known values that are commonly used in the end host systems: 32, 64, 128 and 256. Using this fact, we can easily derive the initial TTL values set by any given source, and then infer the distance between that host and our measurement point. Table 5.2 shows the average distance from the two hosts to the measurement point (da and db ) in terms of the number of router-level AS-level hops. A distance in AS hops of zero means that the traffic is sinked in one of the Sprint’s customer network that does not have a separate AS number. We also computed the average value of the ratio R = min(da , db )/(da + db ). A value close to 0.5 implies that our monitoring point is on average close to the middle point of the path. 3

For the purpose of our analysis we consider the TTL of a connection as the TTL of the first data packet.

48

Avg. number of router hops from end-hosts (da ; db ) Average min(da , db )/(da + db ) Avg. number of AS hops to the end-hosts

CDN 11.57 ; 2.00 0.08

Tier-1 ISP 7.14 ; 8.36 0.41

Intra-POP 4.82; 10.04 0.31

East-Coast 9.22 ; 6.49 0.37

1.72; 0.93

0.88 ; 1.44

0.86 ; 1.56

1.70 ; 0.98

Table 5.2. Average distances from our measurement point

As expected, the CDN’s servers are located very close to the backbone - they are only two hops away on average! On the other hand, the Tier-1 and the OC48 links carry traffic destined to a very diverse set of end hosts; our monitoring point is, on average, close to the midpoint of the paths of those TCP connections. The average distance between client and server is around 15 router hops in these traces. For connections destined to the CDN’s servers, the average distance reduces to 13 hops4 .

5.1.2

Out of sequence packets

Table 5.3 shows the classification results for the observed out-of-sequence packets in the four traces. The first two rows in Table 5.3 indicate the total number of data packets observed, and the absolute number and percentage of these packets that are out-of-sequence. Generally, the number of out-of-sequence packets is limited to about 4% of the total data packets exchanged by the TCP connections. Overall, we observed that only 8.8% of all the studied TCP connections experienced any out of sequence packets (the percentage varies between 6% and 12%, depending on the trace). These connections contribute, however, to a significant fraction of all the data packets (48%). This is not surprising since, intuitively, longer connections are more likely to experience an out-of-sequence event. The last five rows of the table break down the absolute number of out-of-sequence packets according to their cause. We first note that our rules identify the cause of the vast 4 Note that even if the TCP connection is symmetric at the measurement point, we cannot assume that it is symmetric along the entire path. For this reason, we can only compute the distance in router hops from the two end-hosts and in AS hops to the two end-hosts. The total length of the path, however, may differ from the sum of the two components.

49

Data packets

CDN 335,511,737

Out-of-sequence

8,564,463 6,802,611 1,146,633 6,108 604,311 7,902

Retransmissions Unneeded Retransmissions Network Duplicates Reorderings Unknown

Tier-1 ISP 53,652,349

Intra-POP 74,881,505

East Coast 40,662,075

(2.55%)

2,635,620

(3.73%)

2,704,839

(3.61%)

1,627,040

(4.00%)

(79.43%) (13.39%) (0.07%) (7.04%) (0.09%)

1,542,495 307,468 1,772 682,557 101,238

(58.52%) (11.67%) (0.07%) (25.89%) (3.84%)

1,709,690 395,515 2,851 108,505 92,856

(65.27%) (15.10%) (0.09%) (16.06%) (3.54%)

1,041,307 254,397 2,122 269,677 60,442

(64.00%) (15.64%) (0.13%) (16.57%) (3.71%)

Table 5.3. Out-of-sequence packet classification

majority of the data packets, with only between 1% and 4% of the out-of-sequence packets falling into the unknown category. Table 5.3 indicates that the bulk of out-of-sequence packets are due to the retransmission of lost packets. Note that there isn’t a one-to-one relationship between a retransmission and an earlier packet loss since a source may retransmit more than one packet to repair a loss. However, the number of retransmitted packets does provide an approximation of the total number of packet drops experienced by a connection since every lost packet will eventually result in a retransmission. We also observe that unneeded retransmissions make up a significant percentage of all out-of-sequence packets. We should clarify that, from the sender’s perspective, these retransmissions are not really “unneeded”. If an ACK is lost or delayed, the sender has no choice but to retransmit.

5.1.3

Network anomalies

The four traces do not show a significant number of network-replicated packets, an event that appears to be rare in all the examined traces. The amount of packet reordering as a fraction of all data packets is also small. We observe that the magnitude of reordering reported by us is smaller than the results presented in a previous work [9]. Our measurements indicate that reordering affects from 0.17 to 0.96% of all the data packets and that between 0.6 to 5.1% of the connections (with the average being 1.58%) experience reordering. Both these figures are substantially less than those in [9]. We would like to point out, however, that it may not be possible to

50

directly compare the two results due to important methodological differences. The study in [9] relied on active measurements based on ICMP probes. Such probes may sample the network differently than packets in a TCP connection (on which our results are based). For example, a TCP source reduces its sending rate upon detecting congestion, and hence would not sample the network under such conditions. Also, [9] uses probes which are sent backto-back in time. Back-to-back packets may experience a higher probability of reordering when traversing a router or switch characterized by a highly parallel architecture. Given these methodological differences, we have little reason to expect similar results as [9]. However, we note that in [9] the authors did their experiments in 1998 by sending probes through the MAE-East Internet exchange point. They isolated the cause of reordering as due to parallelism within the main network switching device in the exchange point, i.e. the DEC/Gigaswitch. They further conjectured that reordering in general would be a significant factor in the future Internet as a result of increased parallelism in network devices. Our results, four years later, instead suggest the contrary: network reordering affects a smaller percentage of all data packets. The study in [43] uses long-lived TCP connections and reports that between 0.3% and 2% of all data packets are reordered in the various data sets. In those data sets, 12% and 36% of all TCP connections experience at least one reordering event. Although we report a similar percentage of reordered packets, a smaller fraction of flows in our traces are affected by a reordering event. This difference could be due to the use of bulk data transfers compared to our results computed over connections of various size. Although the amount of reordering is limited, it is worthwhile studying the impact of reordering events on a particular TCP connection. For this purpose we define two additional metrics for reordered packets: “packet lag” and “time lag”. Packet lag refers to the number of packets, with sequence numbers greater than that of the reordered packet, that are seen before the reordered packet itself. The time lag, instead, is defined as the differ-

51

ence between the time a hole in a sequence number is discovered and the time it is is filled by the reordered packet. Packet lag is a useful metric to evaluate the impact of reordering on TCP performance: a lag of three or more packets triggers the fast retransmit algorithm and force the TCP sender to halve its congestion window. In the four traces about 93% of the reordered packets have a packet lag less than three. Thus, in most cases, reordering has only a minimal impact on the throughput of a connection. Time lag, in addition to packet lag, permits one to evaluate more precisely the impact of a reordered packet on an end hosts. For example, if the time lag is very short (i.e. less than the delayed acknowledgment timeout – usually in the 50-100ms range) and the packet lag is only one, there will be no impact on the throughput of a TCP connection. In our traces, 92% of the reordered packets show time lags of less than 50ms and 88% of all the reordered packets have time lags less than 50ms and a packet lag of 1. In summary, our results indicate that the amount of data traffic affected by network anomalies such as packet replication or reordering is small and that the impact on the performance of the connections, as perceived by the end-users, is almost negligible.

5.2 Discussion of results We have presented measurement results about the frequency and characteristics of outof-sequence phenomenon as measured in the Sprint IP backbone. We shall now explore some issues regarding the representativeness and completeness of our results.

5.2.1

Representativeness of sampled flows

In our previous analysis, we have only considered flows that are symmetric with respect to our measurement point, i.e flows for which we see packets in both the data and the ACK direction at the measurement point. Note that our methodology does not require a flow to be symmetric along the entire end-end path but only that we can observe data

52

Studied TCP Connections All TCP Connections

CDN 4,752 4,767

Tier-1 ISP 520 4,399

Intra POP 4,837 10,657

East Coast 1,561 8,106

Table 5.4. Number of originating ASes, for studied TCP connections and all TCP connections in the traces

and acknowledgment packets. As noted in Table I, depending on the measurement set, anywhere from 9.2% up to 99% of the flows meet this criteria. An interesting question is if, after removal of flows that are not symmetric at the measurement point, the remaining flows have characteristics that are representative of the entire set of TCP flows in the trace. We examine this question in three different ways. First we compare the number of ASes from which symmetric flows originate with that of all flows in our traces. Then, we examine the distribution of the number of the path length (in router level hops) for the two sets of TCP flows (all flows and only the symmetric ones) in our traces. This provides one simple characteristic of the end-end paths traversed by all flows. Thirdly, we compare the frequency of the out-of-sequence phenomenon between symmetric flows and all flows, in order to examine whether they experience similar conditions along their end-end paths. We infer the origin AS of the source and destination IP addresses of the TCP connections in our traces. The number of originating ASes in our traces is a measure of the geographical diversity of the end-end paths that we study, and Table 5.4 compares the number of number of ASes from which symmetric flows originate, with that for all TCP flows. We find there can be a substantial difference in the number of originating ASes in the flows we study, with respect to the number of originating ASes computed for all TCP flows in a trace. For any given trace, this difference seems to be directly proportional to the percentage of symmetric flows in that trace5 . Overall, combining all the traces, we find that the flows we examine originate in 7,664 unique ASes, while if we look at all TCP flows in our traces, they originate in 11,627 unique ASes. 5

The percentage of symmetric flows is in turn dependent on the nature of the link; access links such as ’CDN’ have a very high percentage of symmetric flows, while peering and backbone links have a substantially smaller fraction of symmetric flows

53

20 Cumulative % of senders

100

% of senders

15

10

5

0

80 60 40 20 0

0

5

10 15 # of Hops

20

25

0

5

10 15 # of Hops

20

25

All TCP Connections Studied TCP Conenctions

Figure 5.2. Distribution of the path length (in router-level hops) from the end-hosts to our measurement point 5.5

5

% of out−of−sequence packets

4.5

4

3.5

3

2.5

2

1.5

1

7

7.5

8

8.5

9 9.5 10 Average number of hops

10.5

11

11.5

12

Figure 5.3. Scatter plot of the average number of hops from measurement point, and the % of OoS packets, in our traces

Figure 5.2 shows the distribution of the path length for the connections in all the traces. We note that the distribution of the length of the path for the connections we examine (i.e., the symmetric ones) and that of all TCP connections in the traces, are somewhat dissimilar. Specifically, a larger fraction of the examined connections are closer to the measurement point, as compared to all flows in the traces. In other words, we observe that closer the measurement point is to one of the end points, there is a higher chance for both the data and ACK paths of a TCP connection to go through the monitoring point. In order to address the possible impact of this bias in symmetric flows, we examine if there is any correlation between the distance of a sender from the measurement point, and the occurrence of out-of-sequence phenomenon. Figure 5.3 is a scatter-plot of the

54

7 All TCP Connections Studied TCP Connections

% of Out−of−sequence packets

6

5

4

3

2

1

0

0

2

4

6

8 10 Trace Index

12

14

16

18

Figure 5.4. Percentage of Out-of-Sequence Packets in all TCP flows, and symmetric TCP flows

average distance of a sender from the measurement point, and the mean percentage of out-of-sequence packets, in 15 different links. In this plot, we consider all flows in these links, and not just those which are symmetric with respect to the measurement point. We observe that there seems to be little correlation, between the distance of the sender from the measurement point, and the percentage of out-of-sequence packets. In a previous work, Padmanabhan et al [41] have also noted the lack of correlation between router-level or AS hop count and end-end loss rate. Thus, we argue, that though the flows we examine in our traces have end-points which are on average closer to the measurement point, than for all flows in the traces, this does not impact (or bias) the magnitude of out-of-sequence phenomenon observed. A direct comparison can also be made between the percentage of out-of-sequence packets in all flows, and flows that are symmetric at the measurement point. Note that while we cannot identify the causes behind an out-of-sequence packet for a non-symmetric flow, (since we lack acknowledgment information), we can determine the magnitude (if not the cause) of out-of-sequence packets for all flows, and for symmetric flows, and then compare these two values.

55

Figure 5.4 presents the percentage of out of sequence packets in symmetric and all TCP flows in our traces.6 . These numbers are from a larger set of 8 traces, which also include the 4 traces we have described earlier. Associated with each of the 8 traces are two links corresponding to the two directions; we consider each link direction separately and thus show the percentage of out-of-sequence packets for 16 links. We note that for 7 of the 16 traces examined, the relative difference in the percentage of out of sequence packets, between symmetric flows and all flows, is less that 10%, while 12 traces show a difference of less than 20%. Although there can exist a discernible difference between the percentage of out of sequence packets in the two sets of flows (as in trace 9), we do not observe any trend that indicates symmetric flows have a consistently larger or smaller amount of out-of-sequence packets than the entire set of flows. To summarize, we have explored the representativeness of the studied flows in our traces in two dimensions, the number of the ASes these flows originate from, and the distance of the end hosts from the measurement point. These metrics capture the characteristics of the end-end paths traversed by these flows. We observe that though symmetric flows originate from a smaller number of ASes than all flows, they cover a very substantial percentage (60%) of the number of all ASes in the Internet. We also find that the end-points of the symmetric flows tend to be, on average, closer to the measurement point than all flows. However, since there is little correlation between the occurrence of out-of-sequence phenomenon and the length of a connection’s end-end path, this does not bias the magnitude of the out-of-sequence phenomenon observed at the measurement point. We further verify this conjecture by comparing the percentage of out-of-sequence packets in symmetric flows with that in all flows in our traces, and found no evidence that symmetric flows consistently 6

In our calculations, we do not consider “truncated” flows, i.e. flows for which we did not observe the SYN or SYN—ACK packet. Such flows are the ones that were already active before the packet trace collection or that experienced a re-routing event. We discard them because we cannot derive how many packets such flows have transmitted before being captured by our measurement equipment.

56

under (or over)-estimate the magnitude of out-of-sequence phenomenon as compared to all flows. 5.2.2 Measuring end-end reordering/duplication frequency Our methodology classifies those out-of-sequence events that manifest themselves at the measurement point. The classification rules are only based on the observation of data packets from the TCP sender while the acknowledgments from the receiver are only used to infer the state at the sender (e.g., fast retransmit/recovery state). This is sufficient to capture out-of-sequence packets due to packet drops either before or after the measurement point (except when an entire window of packets is lost before the measurement point) since the consequence of such events (packet retransmissions) are subsequently observed at the measurement point. However, a closer look at acknowledgments could allow us to identify additional outof-sequence events that would not manifest in any way in the sequence of data packets observed at the measurement point. A clear example of this is when two previously in order data packets are reordered (or duplicated) after the measurement point, on their way to the receiver. In this case, the receiver immediately responds with a duplicate ACK (an ACK for the packet it was expecting to receive). Thus, looking at duplicate ACKs, one could infer additional reordering/duplication events. However, there are several issues with this approach: • A duplicate ACK may be generated upon the receipt of any packet that our methodology has already processed and classified to be out-of-sequence, and we would have to eliminate from our analysis duplicate ACKs for packets that have already been classified to be out-of-sequence. • A receiver sends duplicate ACKs upon the receipt of the unneeded retransmissions of a packet indicating to the sender that it is expecting new data. • Duplicate ACKs are also used to update the receiver advertised window. 57

Figure 5.5. Packet reordering between the measurement point and the receiver

• If the receiver follows a delayed ACK policy, some out-of-sequence events will not trigger a duplicate ACK. For example, in Figure 5.5, packets with sequence numbers x, x + 1 and x + 2 are sent in a burst. When the receiver observes packet x it refrains from sending an ACK since it adopts the delayed ACK policy. Now, packets x+1 and x+2 , are reordered between the measurement point and the receiver, and this triggers an ACK from the receiver, for packet x + 1. However this is not a duplicate ACK. In general, if packets are reordered after the measurement point, the resulting packet lag of the reordered packet is 1, and the receiver adopts a delayed-ACK policy, no duplicate ACKs will be observed at the measurement point for this reordering event. In order to get a rough estimate of the out-of-sequence events (re-ordering and innetwork duplication) that take place after the measurement point, we count all the duplicate ACKs and remove those that are clearly associated with the scenarios mentioned above. Table 5.5 lists the percentage of data packets inferred to be reordered or duplicated after the measurement point. Referring back to Table 5.3 we can sum up the figures for reordering before and after the measurement point to get an estimate of the end-end reordering/duplication in these traces, which are 1.13%, 1.75%, 1.02% and 1.29% respectively.

58

Reordered/Duplicated

CDN 0.96%

Tier-1 ISP 0.79%

Intra POP 0.45%

East Coast 0.63%

Table 5.5. Percentage of packets reordered or duplicated after the measurement point

Link Speed No. of links Unique source ASes TCP connections Percentage of all TCP connx TCP Data packets

East Coast

Intra-POP

Transcontinental

OC-48 (2.5Gbps) 1 1,565 844K 9.20% 18M

OC-48 (2.5Gbps) 1 4,837 6M 25.43% 74M

OC-48 (2.5Gbps) 2 2,326 4.9M 19.20% 110M

Table 5.6. Summary of the traces for TCP analysis

5.3

Characteristics of TCP Connections

We have now described results from the use of our classification techniques in quantifying the occurrence of packet retransmissions, reorderings and replications in the end-end path of TCP connections going over the Sprint backbone network. We now questions that look into the characterisitics and behavior of the TCP senders themselves. We first look into the distribution of congestion window sizes the variability in the RTTs of these TCP connections. We then quantify the prevalence of different TCP congestion control flavors. This is of interest in order to understand how widely deployed are the improvements in TCP’s congestion control mechanism (such as NewReno). We then look into the question of whether a TCP sender is constrained by the network or by application behavior. From a network operator’s point of view, this helps determine the causes which limit a TCP sender’s throughput. Table 5.6 presents a summary of the the data sets we use for the analysis that follows. All the data sets are one hour long and contain a total of approximately 11M TCP connections. Overall, the data sets contain TCP connections originating in 5,819 unique ASes. This represents around 45.3% of the total number of allocated ASes at the time of the trace collection (we counted 12,848 ASes as of November 21st, 2002). 5.3.1

Congestion window

The first metric of interest is the maximum window that a TCP sender reaches during its connection lifetime. As mentioned earlier, the maximum window size is limited by 59

several factors, including the receiver’s advertised window7 . Figure 5.7 shows the empirical cumulative distribution function of the maximum window size during the lifetime of the connections. The five curves correspond to different thresholds for the size of the flows.8 The curve with the minimum threshold, five data packets, has the steepest slope since it includes many small connections that do not have the opportunity to expand their window sizes. The median value of the maximum window for these flows is eight, while 80% of the flows have a maximum window size of less than 11. We observe that as the flow size threshold increases, the distribution of the curves become more similar, with a median maximum window size of approximately 10 to 12 packets. This could be because as the connections become larger, they have a higher probability of either incurring a loss, or being restricted by the receiver-advertised window. Additionally, the distributions show a set of spikes at values corresponding to maximum window sizes of 6, 12, or 16 packets. These values correspond to connections that are restricted by commonly-used advertised window values of 8 and 16 Kbytes, with an MSS of either 536 or 1460 bytes. Although most of the senders have a relatively small maximum window size, most of the packets belong to connections with large window sizes. Figure 5.8 plots the cumulative distribution of packets that belong to a connection with a given maximum window size. Looking at the curve for flows with at least 5 data packets, we observe that although 80% of such flows have a maximum window size of less than 11 data packets, they carry only 45% of the packets. We also observe similar spikes in the distribution of Figure 5.8 as in the previous figure, again corresponding to commonly used values for receiver-advertised windows. 7

As discussed in Section 3.4, we remove senders for which we are unsure of the receiver window from this analysis 8

We have combined data from all the three traces for this analysis, since the distributions of window size were similar across the traces. Please refer to [26] for a per-trace analysis.

60

Percentage of senders

1 0.8 0.6 0.4 Intra−POP Transcontinental East Coast

0.2 0 0 10

1

2

10

3

10

10

Maximum sender window (in packets), for senders > 5 pkts

Percentage of senders

1 0.8 0.6 0.4 Intra−POP Transcontinental East Coast

0.2 0 0 10

1

2

10

3

10

10

Maximum sender window (in packets), for senders > 50 pkts

Figure 5.6. Cumulative fraction of senders as a function of the maximum window

1 > 5pkts > 10pkts > 25pkts > 50pkts > 100pkts

0.9

Cumulative fraction of senders

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 10

1

2

10

3

10

10

Maximum sender window

Figure 5.7. Cumulative fraction of senders as a function of the maximum window

1 > 5pkts > 10pkts > 25pkts > 50pkts > 100pkts

0.9

Cumulative fraction of senders

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 10

1

2

10

10

3

10

Maximum sender window

Figure 5.8. Cumulative fraction of packets as a function of the sender’s maximum window

61

Overall, the throughput of 4% of the connections is limited by the receiver-advertised window at some point in their lifetime. These connections account for about 40% of the packets. If we look at connections with at least 10, 25 and 50 data packets, we observe that 44%, 62% and 72% of such connections, respectively, are limited by the receiver window. 5.3.2

TCP flavors

As described in Section 3.2, congestion window evolution can depend on the flavor of TCP congestion control. Table 5.7 gives the number of connections and packets in the three data sets that conform to a specific flavor. The flavor of a particular TCP sender is defined to be that flavor whose estimated cwnd values resulted in the minimum number of observed violations, i.e., packets sent that exceeded the estimated available window. Senders with no window violations are classified as “indistinguishable”, indicating that the specific flavor does not affect the connection throughput. In case of parity between TCP Reno and NewReno violations, we count the sender as a Reno sender. In Table 5.7 we consider only those senders that transmit more than 13 data packets. This way we only consider connections that have the opportunity to experience a fast retransmit. These senders account for about 8% of all senders but contribute to almost 78% of data packets. Interestingly, the behavior of 97.05% of these senders does not allow us to distinguish a particular flavor of congestion control. This is because the congestion control flavors manifest differences only in the way in which they respond to the receipt of three duplicate acknowledgments. In our traces, this event is relatively rare: only 5% of the senders experience a triple duplicate ACK. Moreover, even after a fast retransmit the three flavors will exhibit different behavior only in presence of specific loss patterns. Overall, only 2.95% of the senders have the opportunity to make use of this flavor-manifesting behavior over their lifetimes. Table 5.7 also shows the number of packets that belong to connections of different flavors. Using this metric, the number of packets that belong to the “indistinguishable” category drops to 67.5%. This is expected, given that large connections have a higher 62

TCP Senders 1,56M 1,51M (97.05%) 640 (0.04%) 28K (1.84%) 17K (1.10%)

Senders ≥13 pkts Indistinguishable Tahoe Reno NewReno

Data Packets 139M 94M (67.51%) 340K (0.17%) 18.5M (13.33%) 26M (19.17%)

Table 5.7. TCP flavors

Percentage of senders

100 80

Indistinguishable NewReno Reno

60 40 20 0 0 10

1

2

10

10

3

10

Threshold (packets)

Percentage of packets

100 80 60 40 20 0 0 10

1

2

10

10

3

10

Threshold (packets)

Figure 5.9. Percentage of Reno/NewReno senders (above) and packets (below) as a function of the data packets to transmit

probability of experiencing losses that allow us to classify the flow as a specific flavor of TCP. Also, the NewReno connections seem to carry a disproportionately large number of packets when compared to Reno senders. But this is expected as well, given that the difference between Reno and NewReno manifests itself only in presence of multiple packet losses in the same window of data. The likelihood of such event happening is higher for long connections with large windows. In order to confirm our conjectures we plot in Figure 5.9 the percentage of senders and packets classified as Reno or NewReno as a function of the number of data packets sent. As expected, the number of senders that fall in the “indistinguishable” category decreases as the number of packets sent increases. Also, as the number of data packets sent increases, NewReno appears to be the more prominant TCP flavor both in terms of senders and packets.

63

5.3.3 Greedy senders A sender is defined as “greedy” if at all times the number of unacknowledged packets in the network equals the the available window size. Under normal conditions, the measurement point would observe a full window of data being sent, a set of acknowledgments being returned, a new (larger) full window of data packets being sent, and so on. In this model of sender behavior, a “flight” is a set of packets sent “together” separated from subsequent flights by the ACKs from the receiver. Using our estimate of the window available to the sender, and this notion of flights of packets, we propose a simple heuristic to identify not-greedy senders: if the number of unacknowledged packets (also defined as flight-size), at the end of a flight, is less than the available window at the beginning of a flight, then the sender is not greedy. The measurement point identifies the beginning of a new flight (and the end of the current flight) as the first data packet observed after an ACK that acknowledged a data packet in the current flight9 . The basic assumption of this heuristic is that a set of packets sent in a “flight” by the sender is never interleaved with acknowledgments for packets in that flight, i.e. that all acknowledgments that arrive in the “middle” of a flight cover data packets belonging to previous flights. The validity of this assumption depends on the position of the measurement point with respect to the sender and receiver. Indeed, if the measurement point is very close to the receiver, each data packet will be immediately acknowledged (or every other data packet in presence of delayed acknowledgments). Hence, the measurement point will estimate a flight size of one or, at most, two packets and deem the sender as not-greedy. In order to identify those connections for which the measurement point is “too close” to the receiver, we define the “ACK-time” as the time between the observation of a data packet and the observation of the acknowledgment that covers that data packet. In our discussion, 9

In the presence of packet losses, the measurement point needs to wait for the end of the recovery phase.

64

100 90

Greedy all flights Greedy for 75% of the fligths

Percentage of senders

80 70 60 50 40 30 20 10 0

less than 0.25

0.25 to 0.50

0.50 to 0.75

more than 0.75

ACK−time / RTT

Figure 5.10. Fraction of greedy senders based on the distance between measurement point and receiver

we use the ratio between the ACK-time and the RTT (called the ACK-time ratio) as an indication of the proximity of the measurement point to the receiver. To summarize, we split the lifetime of a connection into “flights”, and test, at the end of each flight, if the number of packets sent is equal to the sender’s window at the beginning of the flight. We only examine senders that transmit more than 4 flights (i.e., send more than 13 data packets). Nearly 54% of these senders are greedy. In order to test the impact of the proximity to the receivers on our classification, we group the senders into four categories depending on their ACK-time ratio (Figure 5.10). The figure shows that the fraction of greedy senders is largest in the category with an ACKtime ratio of at least 0.75 (the measurement point is far from the receivers). In this category, as much as 68% of all senders are greedy for all the flights, and nearly 79% of senders are greedy for 75% of the flights transmitted. We can consider the greediness of these senders to be a close approximation of the percentage of actual greedy senders. However, it is important to know whether this is a representative set of senders. In order to address this question, we compare the distribution of size of the connections with an ACK-time ratio greater than 0.75 with that of all other connections. Figure 5.11 plots the quantile-quantile plot of the log of the connection sizes. We can observe an almost perfect linear fit for flow sizes until 100 data packets (which corresponds to a little more than 90% of all flows in these two populations), and a linearseeming trend for the higher percentiles. This lends confidence to our assumption that 65

5.5

log10(Size in packets), All senders

5 4.5 4 3.5 3 2.5 2 1.5 1

1

1.5

2 2.5 3 3.5 4 log10(Size in packets), Senders with ACK/RTT > 0.75

4.5

5

Cumulative fraction of senders

Figure 5.11. qq-plot of flow size between flows with large ACK times, and all flows 1 0.8 0.6 0.4 East−Coast Transcontinental Intra−POP

0.2 0 0 10

1

10

2

3

10

10

4

10

5

10

Cumulative fraction of senders

Minimum RTT (in msecs) 1 0.8 0.6 0.4 East−Coast Transcontinental Intra−POP

0.2 0 0 10

1

10

2

3

10

10

4

10

5

10

Median RTT (in msecs)

Figure 5.12. Top: CDF of minimum RTT; Bottom: CDF of median RTT

the connections with large ACK-time ratio are representative of the total population of connections in our traces. 5.3.4 Round trip times Figure 5.12 plots the minimum and median values of the RTTs of the connections in the three data sets. We only plot RTTs for those connections with at least 10 RTT samples (in both directions). We choose this threshold to discount the effects of single erroneous values, and also to have enough samples to examine the variability of the round trip time within the lifetime of a flow. We observe that for 50% of the examined senders the minimum value of the RTT is less than 150-200 msecs. Note, however, that the RTT can range from less than 10 msecs, to as high as 10 secs. The distribution of the median value of the RTT in the lifetime of the flow

66

Cumulative fraction of senders Cumulative fraction of senders

1 0.8 0.6 0.4 East−Coast Transcontinental Intra POP

0.2 0 0 10

1

10

2

3

10

RTT95th percentile/RTT5th percentile

4

10

10

1 0.8 0.6 0.4 East−Coast Transcontinental Intra POP

0.2 0 0 10

1

10

2

3

10

10

RTT95th percentile − RTT5th percentile (in msecs)

4

10

5

10

Figure 5.13. Variability of RTT. Top: ratio 95th/5th percentile; Bottom: difference between 95th and 5th percentiles.

is similar in shape to that of the minimum RTT, and in 50% of the senders, this value is less than 300-400 msecs. Next, in Figure 5.13 we look into the RTT variability within a connection. To this end, we consider the 95th-percentile and the 5th-percentile of the RTT samples. In the top plot of Figure 5.13 we look at the ratio between these two values. Quite consistently across the three data sets, for as much as 50-60% of all connections, the 95th-percentile RTT is less than two times the 5th-percentile, while for nearly 40% of all senders, this can range between 2-6 times. In absolute terms, 50% of the connections exhibit an RTT variability of less that 200-300ms (bottom plot in Figure 5.13). Similar observations about the variability in TCP round-trip times have also been recently reported in [3], using measurements taken at a university access link.10 In their work, the authors report that for 40% of the senders, the ratio between the 90th-percentile and the minimum RTT sample of a connection is less that 2, and for about 50% of the senders, this ratio ranges between 2 − 10. 10 Given the fact that the measurement point is close to the sender in [3], the authors compute a RTT to be the time difference between a data packet and its corresponding ACK.

67

5.4

Conclusions

In this chapter we have applied our techniques to infer performance characteristics of TCP senders and their end-end paths on passive measurements taken within a large Tier-1 backbone network. While our techniques have been validated through simulations and in a wide-area test bed, empirical validation alone is insufficient, since monitored paths can exhibit a wide range of network properties, - many of which may occur only rarely. In the next chapter, we present a novel approach, through a combination of model-checking and formal reasoning to exhaustively identify all possible events in the network for which the inference rules can produce incorrect results.

68

CHAPTER 6 VERIFICATION

As we have mentioned earlier, a difficult challenge for passive measurements based inferences is that an inference technique has to deal with a highly variable network conditions. It is thus imperative to ascertain under what conditions will a given technique work correctly. We now introduce a new approach for the evaluation of passive measurement based inference techniques based on formal analysis. Towards this end, we revisit an earlier described inference problem : the identification of retransmitted and reordered packets within a packet stream as observed by a monitoring point in the middle of a TCP connection’s end-end path. This problem plays an important role in network troubleshooting and diagnosis since packet retransmissions are a measure of the end-end congestion and loss-rate in the path, while reorderings and replications may indicate the presence of faulty equipment, route flaps, or routing loops. Our approach relies on model checking and reasoning about protocol and network behavior. The main challenge in model checking is dealing with the state space explosion problem. This problem occurs in systems with many components that can interact with each other or systems with data structures that can assume many different values. In our case, the problem surfaces because the communication protocol has input parameters (e.g., sequence numbers) each of which takes values from a very large set. We will use a model checker to discover whether the classification rules fail for some specific values (or a bounded range of values) of these parameters. Then, with these cases as a guide and by reasoning about protocol and network behavior, we establish properties of the classification rules for all possible values of the input parameters.

69

Our contributions in this chapter can be summarized as follows. We consider classification rules for the detection and classification of packet retransmissions and reorderings by a passive monitor. We verify the correctness of these rules using the SPIN model checker [23] with the end-points implementing the Go-back-N protocol (GB(N )). For small values of N , we identify the set of events in the network for which the classification rules cannot detect packet retransmissions. By reasoning about the behavior of GB(N ) protocol, we then prove properties about the classification rules for any value of N . We then consider the case of the end-points implementing a TCP-like protocol, termed F astRetx(N ), and again verify the correctness of the classification rules. There exists a significant body of literature about the formal verification of communication protocols [47, 48], and specifically the model checking approach and using tools such as SPIN [11][2][19]. However, to the best of our knowledge, our work is the first to use the tools of formal verification towards the evaluation of measurement techniques. This chapter is organized as follows. We begin with a discussion of related work in Section 2. We precisely define our verification problem and give a high level road map of our approach in Section 3. Section 4 describes in detail how we implement the different components of the system and carry out the verification in SPIN. In Section 5, we introduce the GB(N ) protocol, and verify our inference rules using SPIN, for the cases N = 2, 3, 4. We then provide a formal proof in Section 6, to extend these results for any value of N . Then, in Section 7, we look at a more complicated protocol that we term F astRetx(n) that has TCP-like loss recovery mechanisms, and verify the classification rules for this protocol. We then take a step back and discuss the implications of our observations for the different flavors of TCP in Section 8. Finally, we conclude with a discussion of future directions in Section 9.

70

6.1

Related work

There exists a significant body of literature about the formal verification of communication protocols. Also, the technique of model checking has also been extensively used to verify network protocols. In [39] and [38] the authors use the model checker CMC to identify bugs in the Linux TCP implementation and the AODV routing protocol. The SPIN model checker, in particular, has been widely used, e.g. in [2][19] to verify protocol properties. For a further extended survey of formal verification applications using SPIN, the reader is referred to [23] and proceedings of the SPIN workshops [1]. There also exist other works in the literature that, similar to our approach, use a combination of model checking and formal reasoning for protocol verification, e.g. [47, 48] take a formal reasoning approach to prove properties about Transaction-TCP and TCP-SACK protocols, and [11] uses a combination of SPIN and formal reasoning to verify properties of distance vector routing protocols. Our work is the first to use the tools of formal verification towards the evaluation of measurement techniques. Verification of a measurement technique has to take into account the interactions between the protocol sender and receiver behavior, the channel behavior and the inference technique itself. In subsequent sections we describe the framework using which we verify the classification rules at a passive measurement point.

6.2 Problem definition and approach Our inference problem is as follows. Consider a data sender and a receiver communicating through a protocol that ensures reliable in-order delivery of messages. Messages can be dropped and reordered in the end-end path. A passive monitor is located in the path, and observes all packets between the sender and the receiver. Figure 6.1 illustrates this scenario. The measurement point implements inference rules to identify all packet retransmissions and reorderings in the connection between the sender and the receiver. Our goal is to verify, whether these rules can correctly identify all retransmitted and reordered packets that appear in the end-end path.

71

Monitor

Sender

Receiver

Loss and reordering

Figure 6.1. The network scenario

We model the various components of the network, namely the protocol behavior at the sender and receiver, the communication channels (that induce packet loss and reordering) and the measurement point’s inference rules, using a framework similar to the Communicating Finite State Machine framework (CFSM) that has previously been used [37] for protocol specification and for modelling a network. A CFSM models the nodes of a network as a set of finite state machines, and a set of channels that represent the communication channels between any two nodes. Thus, it is a natural and simple formalism to specify our network model. As a first step we implement this model in the SPIN model checker [23]. The SPIN model checking system has been widely used to verify communication protocols. The protocol and the inference heuristics are specified using the Promela language and a protocol simulator can perform random or guided simulations. The model checker performs an exhaustive state-space search to verify that a given property holds under all possible simulations of the system. In our case, the model checker generates all possible packet loss and reordering events in the channels. It then checks which (if any) retransmissions and reorderings are not detected by the measurement point’s inference rules. Model checking has a number of advantages over traditional approaches based on simulation, testing, and deductive reasoning. In particular, model checking is automatic and usually quite fast. Also, if the design contains an error, model checking will produce a counterexample that can be used to pinpoint the source of the error.

72

Csc1

Ms

Cc1inf

Measurement Cinfc2 Point Minf

Mc1

Cc2r

Receiver Mr

Mc2

Channel daemon

Cc3s

Channel daemon

Sender

Channel daemon

Application

Crc3

Mc3

Figure 6.2. The network model

However, as discussed earlier, the model checker is limited by the state explosion problem. Hence it can handle only some specific and simple values of the protocol’s input parameters. Using the cases generated by the model checker for which the inference rules are found to be incorrect, we extend the results for all values of the protocol’s input parameters through formal reasoning about the protocol behavior.

6.3 Specification and Verification in SPIN In this section, we present a detailed specification of the state machines of the system and the steps leading to the verification in SPIN. The system consists of the following components: Sender and Receiver. These two state machines (Ms and Mr ) implement the transport protocol that governs the communication between the end-hosts. In [28] we provide the Promela code for the various transport protocols that we investigate. Channel daemons. Mc1 , Mc2 , Mc3 are the representations of channel daemons between the sender and the measurement point, the measurement point and the receiver, and the receiver and the sender respectively. These daemons create loss and reordering events in the communication channels between the two end points. As described in the state machine description in [28], these daemons snoop on the channel, and, if there is a packet present, they take one of three actions: (i) let the packet go through with no interference; (ii)

73

remove and drop the packet from the channel; (iii) remove the packet, change its ordering with the subsequent packet, and put them back in the channel. Measurement point. Minf represents the measurement point daemon. This daemon passively observes of packets passing between the protocol sender and receiver, and encodes the rules for detecting and distinguishing between retransmissions and reorderings. The classification rules assume certain properties about protocol behavior, namely that the packets carry monotonically increasing sequence numbers, and that transmission is reliable, i.e. lost packets are retransmitted until received successfully. The inference rules also assume that the timescale at which reordering events occur in the network is smaller than the round trip propagation delay of the connection. Any data packet observed at the measurement point is referred to by the notation (x, t), where x is the sequence number of the packet, and t is the time at which the packet was observed. A packet is defined to be out-of-sequence if it has a sequence number x less than or equal to that of other previously observed packet. When the measurement point does observe an out-of-sequence packet it classifies it as a retransmission or a reordering based on the following rules • classify.retx: Let (x, torig ) refer to the packet with sequence number x sent by the sender, and observed at time torig . If after some time, we observe an out-of-sequence packet (x, toos ), with the same sequence number x and toos > torig , then (x, toos ) is a retransmission. • classify.reord: Suppose the measurement point observes an out-of-sequence packet (x, toos ), and let tinseq (tinseq < toos ) be the earliest time at which a packet with a sequence number greater than x is observed, we then look at the timelag toos − tinseq , – if toos − tinseq ≥ RT O, (x, toos ) is a retransmission – if toos − tinseq < RT T , (x, toos ) is a reordering

74

Here RT O is the value of the sender’s retransmission timer (that we assume the measurement point knows) and RT T is the measurement point’s estimate of the round trip delay of the end-end path. We also introduce Csc1 , Cc1inf , Cinf c2 , Cc2r , Crc3 and Cc3s , as the channels between the various components. A representation of the system is found in Figure 6.2. We must now choose the communication protocol used by the sender and the receiver. As a first step, we study the inference techniques with the end points implementing the Goback-N (GB(N )) protocol. We choose GB(N ) because it is the simplest window-based protocol, and a tractable first step in analyzing more complicated window-based protocols such as TCP. Later, we extend our analysis to a more complicated protocol, which while retaining features of GB(N ) such as a fixed window, and a go-back-N timeout mechanism, also incorporates TCP like mechanisms such as loss recovery through fast retransmit. Details of the two protocols will be presented in subsequent sections. Our goal is to determine whether all retransmitted and reordered packets that reach the measurement point are detected as out-of-sequence by the measurement point, and then classified correctly to be either a retransmission or a reordering. In order to perform this check with SPIN, we introduce a type field in the header of any packet transmitted by the sender. This field can take one of three values: ¡Norm, Retx, Reord¿. When a packet is transmitted by the sender for the first time, its type field is set to Norm. If this packet is dropped by the channel daemon process, then when a packet with the same sequence number is retransmitted by the sender, its type is set to Retx. Similarly, if the channel daemon reorders the position of this packet with a packet sent later in time by the sender, it rewrites the type field to Reord. The model checker can generate all network events that would cause packet retransmissions and reorderings. When the measurement point observes a packet, it checks if this packet is out-of-sequence based on whether the packet has a sequence number less than or equal to that of a previously observed packet. If this is the case, the measurement point

75

invokes the rules classify.retx and classify.reord and classifies the packet to be a retransmission or a reordering. Finally, it compares its decision with the type field in the packet, and if the two do not match, we have a case in which a packet was misclassified by the measurement point.

6.3.1 Discussion of model assumptions In this section we have described our system model in detail. As stated, our primary goal is to verify whether the measurement point can detect and correctly classify all packet retransmissions and reorderings. We have made several assumptions in our model to simplify the verification process. We now discuss, how (and if) each of these assumptions impacts the completeness of our study. Firstly, we have excluded the occurrence of packet replications in the end-end path. While our intention is to allow all possible events that result in out-of-sequence packets in order to comprehensively test the classification rules, we exclude packet replications since it makes verification simpler, and several measurement studies [26, 43] have indicated that packet replications are an extremely rare phenomenon in the Internet. Our classification rules determine whether an out-of-sequence packet is a retransmission or a reordered packet by comparing its timelag with the connection’s RTT and RTO. We assume that the measurement always has an accurate estimate of the connection’s RTT and RTO (the reader is referred to [27, ?] for techniques that provide estimates of e.g., a TCP connection’s RTT/RTO from a passive measurement point). We also impose restrictions on the time scale of packet reordering and retransmission events. It is clear that our classification rules may give incorrect results if the measurement point has an incorrect estimate of the sender’s RTT and RTO, or if the time scales of these events do not conform to our assumptions. However, our goal in this verification exercise is not to verify the assumptions around which the classification rules were designed. Instead we aim to investigate whether there

76

exist network scenarios that would cause the classification rules to give incorrect results even when all the assumptions hold. As with all verification efforts, the results of the verification process hold only when the stated assumptions hold true. Finally, we restrict reordering events to occur only between successive packets. We make this simplifying assumption since the only aspect of packet reordering that determines the decision of the classification rules is the timelag of the reordered packe. Thus, whether a packet is reordered with a subsequent packet, or across multiple packets, are equivalent events as long as the timelag of the reordered packet, satisfies our assumptions about the time scale of reordering events.

6.4

Verifying inference heuristics for GB(N )

In this section we consider the GB(N ) protocol and prove its correctness even in the case of in-network reordering of packets. We then show the correctness of the measurement point’s out-of-sequence inference technique using SPIN.

6.4.1 Correctness of GB(N) in the presence of reordering The GB(N ) protocol, in brief, operates as follows. The sender can have as many as N outstanding packets in the channel at any point of time. The receiver has a buffer of one, and hence only accepts packets that are in sequence. When a packet is dropped, the sender recovers from loss only by using a timeout mechanism. If no acknowledgement is received before the timeout expires, the sender retransmits all packets that are yet unacknowledged. More details of the GB(N ) protocol can be found in [49, 31], and a description of the GB(N ) sender and receiver state machines are presented in [28]. GB(N ) uses a sequence number variable carried by both data and ACK packets. The range of values taken by this variable impacts the model checking process – a large range leads to a state space explosion when the model checker performs an exhaustive state space search. In order to get around this problem, we use a version of the GB(N ) protocol that

77

Sender x x+1 x+2

Receiver

Reordering timelag = (t3 - t2)

time

Figure 6.3. Timelag of a reordering event: packet with sequence number x is reordered with subsequently sent sequence number x + 1

uses a small bounded range of sequence numbers. However, this requires us to ensure that the protocol behaves correctly, in the presence of packet loss and reordering, despite this constraint. The protocol behaves correctly if the receiver never releases a packet out of the correct order to the higher layer. More formally, let {p1 , p2 , ..., pn } be packets released by the sender with their indexes indicating the order in which they are sent and let {x1 , x2 , ..., xn } be their respective sequence numbers. A protocol behaves correctly if the receiver (by examining the sequence numbers of the received packets) always sends these packets up to the higher layer in the order in which they are sent. It has been earlier noted [10], that the GB(N ) protocol behaves correctly if the protocol adopts a sequence number range which is equal to m, where m > N . However the proof of this property makes the assumption that packets are delivered in sequence by the communication channel between the sender and the receiver. This assumption is not true in our context since the channel daemons can reorder packets along the end-end path. We now establish conditions under which GB(N ) protocol behaves correctly with a bounded range of sequence numbers in the presence of reordering in the end-end path. We first define the notion of timelag of a reordering event. Let a packet with sequence number x be released by the sender at time t1 . Let x + 1, x + 2 be the sequence numbers of packets released by the sender at point in time later than t1 . Now suppose packet x is reordered, i.e., it arrives at the receiver later than some packet with sequence number > x. Also let t2 be the time at which the first of any packets with sequence number > x, reaches

78

the receiver, and t3 the time when x finally arrives at the receiver. Then the timelag of this reordering event is t3 − t2 . This is illustrated in Figure 6.3.

Theorem 6.1 The correctness of the GB(N ) protocol is preserved if the sender uses a bounded range of sequence numbers m, where m > 2N , and the timelag of any reordering event is always less than the minimum two-way propagation delay D. Proof: Suppose a packet, p, with sequence number (S mod m) is released by the sender and reordered in the channel (i.e., it is exchanged with a packet sent later in time by the sender). Now suppose the maximum sequence number observed at the receiver before packet p eventually reaches the receiver is ((S + x)mod m). First, consider the case when x < 2N . Since m > 2N , it is easy to see packet p could never be erroneously accepted by the receiver, since there could not have been any sequence number wrap around between (S mod m) and ((S + x)mod m). Now consider the case when x ≥ 2N i.e., ((S + 2N ) mod m) has been observed at the receiver. This would imply that the packet with sequence number ((S + N ) mod m) has also been observed by the receiver, since sequence number ((S + 2N ) mod m) could have been sent only after the sender received an ACK for ((S + N ) mod m). Now, the interval between when sequence numbers ((S + 2N ) mod m) and ((S + N ) mod m) arrive at the receiver must be ≥ D. This implies that the reordering timelag of packet p is also greater than D, since the packet with sequence number ((S + N ) mod m) was transmitted later than (S mod m). However, we have imposed a restriction on the channel that no reordering event occurs with a timelag larger than D. Thus when a reordered packet with sequence number (S mod m) arrives at the receiver, the maximum sequence number observed must be less than ((S + 2N ) mod m), and so the protocol functions correctly. ¤

79

6.4.2 Detecting retransmitted and reordered packets As suggested by Theorem 6.1, we adopt 2N + 1 as the range of sequence numbers used by the GB(N ) specification in SPIN, and, to ensure the correctness of the protocol, impose the constraints that reordering timelags are less than the round trip propagation delay of the connection. Apart from these restrictions to guarantee the correctness of GB(N ), we make the following additional assumptions about network and sender behavior througout the rest of the paper: sender 1 2 3 1 2 3

measurement receiver point

12

1 2 3 4

42

2 3 4

time

3

3 not detected as OoS

3 4

4 not detected as OoS

2 not detected as OoS

!"#$%&

Figure 6.4. Cases in which retransmitted and reordered packets are undetected at measurement point, for GB(3), m = 7.

• The channel daemons do not cause any packet replications. A replicated packet would also manifest itself as an out-of-sequence packet at the measurement point and must be distinguished from retransmissions and reorderings for correct classification. However, measurement studies [25] have shown that packet replications are extremely rare in the Internet, and hence we exclude their occurrence from the model. • The channel daemon between the measurement point and the receiver does not re-

order any packets. A packet reordered after the measurement point will not be outof-sequence at the measurement point, hence the measurement point cannot detect this reordering event, and it is also excluded from the model. • The sender always has data to send, and the retransmission timer of the sender is

not triggered before it has sent all packets allowed by its window. We make these 80

assumptions to exclude trivial cases in which the measurement point cannot detect packet retransmissions, such as if the last packet of the session is lost before the measurement point and is then retransmitted, or if a packet is lost, and retransmitted before the sender could transmit any more packets. Removal of these cases also simplifies the verification by SPIN. Furthermore, we initially consider the case where there are no losses or reorderings of an ACK packet and that a packet is never retransmitted because of a delay in the arrival of its ACK (we refer to this case as unneeded retransmissions). These last two assumptions allow us to significantly simplify the verification in SPIN but later we will relax them and verify the correctness of the classification rules when ACKs can be lost or delayed. Our goal is to verify the following two properties of the classification rules using the model checker with the GB(N ) protocol: 1. the completeness of the inference heuristics, i.e., whether all retransmitted and reordered packets that reach the measurement point are identified as out-of-sequence packets. 2. the correctness of the classification rules, classify.retx and classify.reord i.e., the measurement point never misclassifies an out-of-sequence packet, given the assumptions we make about the sender and the network behavior. We used the model checker for a GB(N ) protocol with N ≤ 4 (see [28] for the Promela code used in this work). The SPIN model checker identified the following cases under which the measurement point could not detect an out-of-sequence packet. Let there be a packet with sequence number x, transmitted by the sender at time t0 and lost before the measurement point. A subsequent retransmission with the same sequence number x transmitted by the sender at time t00 , is not detected to be out-of-sequence at the measurement point when any of the events {E1 (x, t00 ), E2 (x, t00 ), E3 (x, t00 )} occur (see Figure 6.4): 81

Event E1 (x, t00 ): The packet with sequence number x − N + 1 was also lost before the initial transmission of sequence number x was lost. Event E2 (x, t00 ): The packet with sequence number x − N + 1 was reordered in sequence with an earlier sent packet. Both E1 (x, t00 ) and E2 (x, t00 ) involve a packet with sequence number x lost before the measurement point, and the packet with sequence number x−N +1 dropped (or reordered) in the path. As mentioned earlier, a GB(N ) receiver only accepts those packets which are in order. If the packet with sequence number x − N + 1 is lost (E1 ), none of the subsequent packets that reach the receiver are acknowledged. Also, if x − N + 1 does not arrive in order (E2 ), it is dropped by the receiver, and this has the same impact as when x − N + 1 was lost in the path. At this point, the sender has N unacknowledged packets in the channel (sequence numbers x − N + 1, .... ,x), and the receiver will not send any new acknowledgement forcing the sender to wait for the timeout to expire. Thus, the retransmission of packet x will not be detected as an out-of-sequence packet by the measurement point given that no packet with a sequence number larger than x has ever been sent. Event E3 (x, t00 ): All of the possible packets with sequence numbers y ∈ {x + 1, ..., x + N − 1} sent by the sender between the first transmission at t0 and t00 were also dropped. Event E3 (x, t00 ) describes the case when a packet with sequence number x is lost before the measurement point, and all subsequent packets sent with sequence numbers greater than x are also dropped before the measurement point. Again, it is easy to see that, when the retransmission of sequence number x arrives at the monitor, it will not be considered to be out-of-sequence, as shown in Figure 6.4.

6.4.3 ACK losses So far, we have not considered the case of acknowledgement loss by the channel. We now relax this assumption by adding a channel daemon between the receiver and the sender. 82

sender 1 2 3 1 2 3

time

measurement receiver point Loss of packets and ACKs prevents window advance. No new packets sent after 3.

3 not detected as OoS

Figure 6.5. Case E4 (x, t00 ), GB(3)

Upon the arrival of an ACK, the daemon either lets the packet through, or drops it. With this addition, we again submit our model to SPIN. SPIN now exposes a new event for which the measurement point cannot detect a retransmitted packet as being out-of-sequence (Figure 6.5): Event E4 (x, t00 ): A packet with sequence number x, transmitted by the sender at time t0 is lost before the measurement point. For all packets with sequence numbers x − N + 1, ...., x − 1, sent prior to t0 , one of the following events occur: • the packet is lost in the path. • the packet reaches the receiver and if it generates an ACK that cumulatively acknowledges sequence number x − N + 1, this ACK is lost. A subsequent retransmission of x sent at time t00 is not detected to be out-ofsequence by the measurement point.

The impact of lost ACKs is to prevent the sender’s window from freeing up and can thus restrict the transmission of new packets. For the sender’s window to move beyond a sequence number x, it should receive at least one new ACK for the packets, with sequence numbers {x − N + 1, x − N + 2, ...x − 1}, sent prior to an earlier version of the packet with sequence number x. We have already shown that this cannot occur if events E1 (x, t00 ) and E2 (x, t00 ) occur. However, even if these two events do not occur, but all the N − 1 packets sent prior to sequence number x are either lost, or if they generate an ACK that acknowledges sequence number x − N + 1, and this ACK is lost, then its easy to see that

83

no new ACKs will reach the sender after the initial transmission of x, and its window will not move beyond sequence number x.

6.4.4

Unneeded Retransmissions

We now remove the assumption that acknowledgements are never delayed enough to force the sender to retransmit a packet that has already reached the receiver. Thus we now allow for the sender to retransmit a packet while the initial transmission of that sequence number (and its corresponding ACK) are still in the channel. In order to verify this modified network model we need to assume that there are no losses on the acknowledgement path given that SPIN could not complete the verification (due to state-explosion) even for small values of N . We then run the model checker for GB(N ) with N = {2, 3} and identify a new event in which a retransmitted packet is not detected by the measurement point (see Figure 6.6). Event E5 (x, t00 ): Let a packet with sequence number x, transmitted by the sender at time t0 , that is lost before the measurement point. Let y be the sequence number of the first packet sent by the sender after t0 , such that y > x. Consider a subsequent retransmission with the same sequence number x transmitted by the sender at time t00 . This retransmitted packet is not detected to be out-of-sequence if it is reordered with sequence number y before the measurement point. As clearly illustrated in Figure 6.6, the packet with sequence number x = 2 will be mistaken by the measurement point for the original transmission and be inevitably not deemed out-of-sequence.

6.4.5 Classifying out-of-sequence packets We now proceed to the second part of our verification which is to show that when a measurement point observes an out-of-sequence packet, the decision it makes (whether the packet is a retransmission or a reordering) is consistent with what actually happened in the network. Now, as we have described in the earlier section, the classification rules rely on

84

sender 1 2 3

42

measurement point

receiver

3 4

time

2 not detected as OoS

Figure 6.6. Case E5 (x, t00 ), GB(3): Packet with sequence number x = 2 is lost before the measurement point. Packet with sequence number y = 4 is the first packet with sequence number > 2 sent by the sender that reaches the measurement point. The retransmission of 2 is reordered with sequence number 4 and is not detected as OoS at the measurement point.

the different timelags of packet retransmissions and reorderings to distinguish between the two (see Section 6.3). The measurement point makes its decisions based on the time interval between when the measurement point observes an out-of-sequence packet x, and when it observes packet x0 , where x0 is the first packet with a sequence number greater than x that is observed at the measurement point. The magnitude of this interval determines whether the OoS packet is a reordering or a retransmission. However, the model checker has no built-in notion of time and there is no mechanism to quantify the time between any ordered sequence of events. Thus, in order to get around this problem, we impose the property that when reordering or retransmission occurs, the timelags conform to the assumptions (i.e. reordering timelag is less than RT T and retransmission timelag is greater than the connection’s RT O). As a result of this assumption, the model checker throws up only one case in which the measurement point’s decision does not match the type field carried by the packet. This happens when a retransmitted packet is itself reordered with other packets. In this case the resulting out-of-sequence packet is simply classified to be a retransmission, since the timelag of this packet is still greater than

85

the RT O of the sender. Note that in this case the packet would belong at the same time to two states (retranmitted and reordered) that the rules in Section 6.3 explicitly do not allow.

6.5 Generalization to any N The events described so far were discovered by SPIN for small values of N . The next question is whether these are the only events that result in a retransmitted or reordered packet going undetected at the measurement point for any value of N of the GB(N ) protocol. By reasoning about the behavior of the protocol and our network model, we can prove that those events are the only ones that would cause a retransmitted or reordered packet to pass undetected by the measurement point. Before proceeding, let us define some notation: • A(x, t, t00 ) is true if some packet with sequence number x sent by the sender at time t00 arrives at the measurement point at a time t, and this packet is either a retransmission or a reordering. • P (x, t, t00 ) is true only if at some point in time before t, the measurement point also observed a packet with a sequence number x0 , such that x ≤ x0 , i.e., it is true if x is recognized as out-of-sequence by the measurement point. • Q(x, t00 ) is equal to ¬(E1 (x, t00 ) ∨ E2 (x, t00 ) ∨ E3 (x, t00 ) ∨ E4 (x, t00 ) ∨ E5 (x, t00 )), i.e., it is true if none of the events E1 − E5 has occurred. We can then state the following. Theorem 6.2 Suppose a retransmitted or reordered packet sent with sequence number x at time t00 using the GB(N ) protocol, for any N , arrives at the measurement point at time t. Also if Q(x, t00 ) is true, then, P (x, t, t00 ) holds, that is,

A(x, t, t00 ) ∧ Q(x, t00 ) → P (x, t, t00 ) 86

(6.1)

Proof: Our proof is by contradiction. We initially assume that: 1. A(x, t, t00 ) is true, i.e. a retransmitted or reordered packet reached the measurement point at time t. We shall refer to this packet as p. 2. P (x, t, t00 ) is not true, i.e., the measurement point does not consider packet p out-ofsequence. 3. Q(x, t00 ) is not true, i.e., none of the events E1 − E5 have occurred. We will show that if 1) and 2) hold, then 3) cannot be true, and one of the events E1 −E5 must have occurred. If P (x, t, t00 ) is false, the measurement point has not seen any packet with sequence number x0 such that x ≤ x0 . This implies that p has to be a retransmission, since if it had been a reordered packet then a packet with a sequence number greater than x must have appeared at the measurement point before t. Now, the only ways in which the measurement point could not observe a packet with sequence number greater than or equal to x between the original transmission of a packet with sequence number x and p could be because: • The sender could not send any packet with a sequence number x0 greater than x since the original transmission of sequence number x. Since we assume the sender always has data to send, this can happen only because the sender has not seen an acknowledgment for any of the N − 1 packets sent previous to x. In order for this to occur, one of the following events must have happened: – the packet with sequence number x00 = x − N + 1 was lost in the channel. – the packet with sequence number x00 = x − N + 1 was reordered with an earlier sent packet in the channel. – or for all packets with sequence numbers {x − N + 1, x − N + 2, ...x − 1}, either the packet is lost, or if it triggers an ACK that acknowledges sequence number x − N + 1, then this ACK is lost. 87

This would basically result in the sender having N unacknowledged packets in the network and stopping it from sending more packets. Any of these events would freeze the advancement of the right edge of the sender window beyond x (as is illustrated in Figure 6.4), and no new packets with sequence number > x would be transmitted. However these events correspond to E1 (x, t00 ), E2 (x, t00 ) and E4 (x, t00 ) which we assume to have not occurred. • The sender could potentially send packets with sequence numbers x + 1, x + 2, ...x + N − 1 before it retransmitted x, and these were all lost before the measurement point could observe them. This corresponds to event E3 (x, t00 ). • One of the packets with sequence numbers x + 1, x + 2, ...x + N − 1, that the sender could potentially send before retransmitting x, made it to the measurement point. Let p0 be the first of these packets. If p0 is reordered in position with p by the channel daemon then the retransmitted packet p would not be considered to be OoS by the measurement point. This is event E5 (x, t00 ). Hence, we have shown that with GB(N ) a retransmitted or reordered packet is not detected at the measurement point only if one of the events E1 − E5 occurs for any value of N . ¤ To summarize, using SPIN together with formal reasoning we have uncovered all cases in which the measurement point can not detect packet retransmissions and reorderings as out-of-sequence packets.

6.6 Extensions to GB(N ) We now consider extensions to GB(N ) that more closely capture the behavior of TCP. We name this extended protocol F astRetx(N ). Just as in GB(N ), the F astRetx(N ) sender has a fixed window, and can have N packets outstanding at any point of time. However, there are two crucial differences. First, the F astRetx(N ) receiver has a buffer of 88

sender 1 2 3 4

measurement receiver point

1 1

1 4 5 6 4 4 4 time

4 detected as OoS

Figure 6.7. Recovery through Fast Retransmit in F astRetx(4), dupacks = 2

size N , and also accepts and stores packets that do not arrive in order. The receipt of these packets triggers an acknowledgement that only specifies the next expected packet, i.e., ACKs are cumulative. The second difference lies in how the sender detects and recovers from packet loss. Say a packet is dropped by the channel. As subsequent packets arrive at the receiver, they are buffered and trigger duplicate ACKs for the lost packet. When the sender receives a certain threshold (referred as the dupack threshold) worth of such duplicate ACKs, it goes ahead and retransmits the packet indicated to be lost by these duplicate ACKs. If this retransmission subsequently arrives at the receiver, it responds with a new ACK for the maximum sequence number it has received in sequence, and the sender continues with its normal transmission. In some cases however, if several packets are lost within a window, then the required number of duplicate ACKs may not be generated by the receiver or the ACKs maybe lost before reaching the sender. In this case, the sender times out and, just as in GB(N ), retransmits the entire window of packets starting from the first unacknowledged packet. A Promela implementation of F astRetx(N ) can be found in [28]. We have implemented a model of the F astRetx(N ) protocol in SPIN. We verified this model for N = 4 and the duplicate ACK threshold, dupack = 2. In our network model we

89

sender 1 2 3 4

receiver

1 1 1

1 1 2 3 4

5

5 6 7 8

time

5 5 5 5

5 not detected as OoS

Figure 6.8. Event E7 (x, t00 ). Packet with sequence number 5 is lost before measurement point and, due to a successive fast retransmit event, is retransmitted before sender could send any packets with sequence number > 5. The retransmission of 5 is not detected to be OoS.

allow losses before and after the measurement point, reorderings before the measurement point and ACK losses between the receiver and the sender. We examined the cases under which retransmissions and reorderings are not detected as out-of-sequence packets at the measurement point, and our observations are summarized below: 1. SPIN again uncovered scenarios equivalent to events E3 (x, t0 ) and E5 (x, t0 ) (described for the GB(N ) protocol), that result in a retransmitted packet to be not detected as out-of-sequence at the measurement point. 2. In GB(N ), for any packet with sequence number x, if the first packet in the window with respect to this packet (namely the sequence number x − N + 1) is lost or reordered, then the sender’s window does not move beyond sequence number x. Thus, if the packet with sequence number x is lost before the measurement point then its subsequent retransmission is not detected to be out-of-sequence - this constitutes events E1 (x, t0 ) and E2 (x, t0 ) for the GB(N ) protocol. In the F astRetx(N ) 90

protocol, since an out-of-order packet is buffered and eventually acknowledged, we observe that the reordering of sequence number x − N + 1 does not impact the advancement of the sender’s window beyond x. Hence, there is no event equivalent to E2 (x, t00 ) in F astRetx(N ). Moreover, we found that even if the packet with sequence number x − N + 1 is dropped, if the loss of this packet is detected through the fast retransmit mechanism, the sender’s window can move beyond sequence number x. We explain this with an example illustrated in Figure 6.7. As shown in the figure, the packet with sequence number 1 is dropped, and also sequence number 4 is lost before the measurement point. When sequence numbers 2 and 3 arrive at the receiver, they trigger duplicate ACKs for sequence number 1. This results in a fast retransmit of sequence number 1, and subsequently a new ACK for 4. The sender now responds with two new packets with sequence numbers 5 and 6 (i.e. its window has now advanced beyond 4). When these packets are received, since 4 has still not been received, they again trigger duplicate ACKS, and sequence number 4 is finally retransmitted. However, 4 is now detected to be out-of-sequence (because the measurement point has seen sequence numbers > 4), unlike in GB(N ). To summarize, suppose a packet (x, t0 ) is lost before the measurement point, and also the packet with sequence number x − N + 1 is dropped. Now, only if sequence number x − N + 1 is not successfully retransmitted (and acknowledged) using the fast retransmit mechanism, will the sender’s window fail to move beyond sequence number x. And when this happens, the retransmission of x will not be detected as out-of-sequence at the measurement point. We shall refer to this event as E6 (x, t0 ). 3. SPIN uncovered a new event in which a retransmitted packet is not detected by the measurement point. This is a consequence of the successive fast retransmit event [20] that occurs as a result of the receiver sending multiple duplicate ACKs upon receiving packets that it has already buffered. As illustrated in Figure 6.8, sequence number 1 is lost, and retransmitted by the sender after receiving duplicate ACKs generated by 91

the arrival of sequence numbers 2, 3 and 4 at the receiver. However, the retransmitted packet 1 is dropped again, triggering a timeout event and subsequent retransmission of all sequence numbers 1 − 4 by the sender. Upon receiving these packets, the receiver responds with an ACK for sequence number 5 (since it has already received and buffered sequence numbers 2−4). The sender transmits 5 after receiving the first of these new ACKs, however, this packet is dropped before the measurement point. The following three ACKs for 5 are then interpreted as duplicate ACKs by the sender and it again retransmits 5. This retransmission is not detected to be out-of-sequence by the measurement point, since the sender did not send any new packets after the initial transmission of 5. We refer to this event as E7 (x, t00 ). We observed these events for N = 4, for the F astRetx(N ) protocol. We now show that these are the only events that result in a retransmission or a reordered packet to be not identified as out-of-sequence at the measurement point, for any value of N . Theorem 6.3 Suppose a retransmitted or reordered packet sent with sequence number x at time t00 using the F astRetx(N ) protocol, for any N , arrives at measurement point at time t. If none of the events {E3 (x, t00 ), E5 (x, t00 ), E6 (x, t00 ), E7 (x, t00 )} have occurred, then P (x, t, t00 ) holds. Proof: Once again, our proof is by contradiction, i.e. we initially assume that none of the events {E3 (x, t00 ), E5 (x, t00 ), E6 (x, t00 ), E7 (x, t00 )} has occurred, and that a retransmission or a reordered packet has arrived at the measurement, but P (x, t, t00 ) is false. This leads to a contradiction, hence one of the events in the above described set must have occurred. If P (x, t, t00 ) is false, the measurement point has not seen any packet with sequence number x0 such that x ≤ x0 . This implies that p has to be a retransmission, since if it had been a reordered packet then a packet with a sequence number greater than x must have appeared at the measurement point before t. Now, the only ways in which the measurement point could not observe a packet with sequence number greater or equal to x between the 92

original transmission of a packet with sequence number x and this retransmission p could be due to one of the following: • The sender could not send any packets with a sequence number x0 greater than x since the original transmission of sequence number x. Since we assume the sender has always data to send, this can happen only if the sender has not seen an acknowledgment for any of the N − 1 packets sent previous to x. In order for this to occur, either of the following events must have occurred: – the first packet of the window with respect to sequence number x, i.e. the packet with sequence number x00 = x − N + 1 was lost in the channel, and the sender could not recover from the loss using the fast retransmit mechanism. Since, if the sequence number x − N + 1 had been successfully retransmitted using the fast retransmit mechanism, the sender’s window would have moved beyond x, before retransmitting x. – sequence number x − N + 1 was not lost, but the sum of the number of ACK losses (duplicate or new) and packet losses, amongst packets with sequence numbers {x − N + 1, x − N + 2, ...x − 1}, equals N − 1. This would result in no new ACKs reaching the sender before it retransmits x. Either of these events would freeze the advancement of the right edge of the sender window beyond x, and no new packets with sequence number greater than x would be transmitted. However these events correspond to E5 (x, t00 ) and E6 (x, t00 ) which we assume to have not occurred. • The sender could potentially send packets with sequence numbers x + 1, x + 2, ...x + N − 1 before it retransmitted x, and these were all lost before the measurement point could observe them; this corresponds to event E3 (x, t00 ).

93

• The sender retransmitted packet sequence number x before it could transmit any packet with sequence number > x. Because of Assumption 3 (stated in Section 6.4), this retransmission of x cannot be triggered by a timeout - since the sender has always data to send, and must be able to send new packets in the period between the initial transmission of x and the triggering of the retransmission timer. Hence, x much have been retransmitted due to the arrival of duplicate ACKs. Also, the duplicate ACKs must be triggered by packets sent prior to the initial transmission of x, since the sender did not send any new packets before it retransmitted x. If packets sent prior to sequence number x trigger ACKs for x at the receiver, this must be because the sequence numbers of these packets have already been received, and the ACKs for x are an indication that the receiver expects to see a new sequence number x. This is a case of successive fast retransmit, and the retransmission of x triggered by this event will not be detected as out-of-sequence. This corresponds to event E7 (x, t00 ). Hence, we have shown that a retransmitted or reordered packet is not detected at the measurement point only if either of the above mentioned events occur for any value of N of the F astRetx(N ) protocol. ¤

6.7

Implications for TCP

We have verified the classification rules with the end-points implementing simple protocols such as GB(N ) and F astRetx(N ). We now discuss the implications of our observations if the end-points implement the TCP protocol. We focus on F astRetx(N ) since by design it more closely follows TCP behavior, and apply what we have learned from the verification of classification rules for this protocol to two flavors of TCP, namely TCP Reno and NewReno. We first note that with TCP Reno, when events E3(x, t0 ), E4(x, t00 ), E5(x, t00 )E6(x, t00 ) occur, a retransmitted packet will not be detected by the measurement point. Figure 6.9 94

sender

measurement point receiver

1 2 3 4

1

1 2 3

1

1 2 3

1

1 2 3 4

1 1

4 1 2

1

3

2 2

timeout

2

4 5

2 3

3 4

time 2,3 not detected as OoS E3

3 not detected as OoS

2 not detected as OoS E5

E4

2 not detected as OoS E6

Figure 6.9. Undetected retransmissions in TCP Reno

illustrates these cases, except for E7(x, t00 ), for TCP Reno1 . We also observe that each of these events involves the timeout event at the sender. This implies that the sender behavior described in these cases is independent of the TCP flavor - since all TCP congestion control flavors behave in exactly the same manner subsequent to a timeout event. Thus, the occurrence of the events E3(x, t0 ), E4(x, t00 ), E5(x, t00 )E6(x, t00 ) would lead to the non-detection of a retransmitted packet even for the TCP New Reno protocol. If the TCP end-points use the SACK option, then by examining the SACK blocks sent by the receiver, a TCP sender can infer exactly which packets are lost in the channel. A TCP sender, that uses the SACK option, will never misinterpret duplicate ACKs (for a new packet) as indicators of packet loss, and hence can never undergo a successive fast retransmit. Thus, event E7(x, t00 ) can not occur if the end-points use TCP SACK. A TCP sender has a dynamically changing window (namely, the additive increase, multiplicative decrease congestion avoidance scheme). This is the primary difference that distinguishes a TCP sender from a F astRetx(N ) sender. However, as we have proven in Sections 6.5 and 6.6 about the behavior of GB(N ) and F astRetx(N ) protocols, the value of the sender window has no impact on the measurement point’s ability to detect packet reThe sequence of events leading to E7 are intricate, and we exclude the example due to space limitations, however it is presented in the extended version of this paper. 1

95

transmissions and reorderings. Thus we believe, given that F astRetx(N ) captures TCP’s loss recovery mechanism, events E3(x, t0 ), E4(x, t00 ), E5(x, t00 ), E6(x, t00 ), (and E7(x, t00 ) if the end-points do not use SACK) are the only events for which the classification rules do not detect retransmitted or reordered packets. In future work, we plan to implement the differences between F astRetx(N ) and TCP protocols, and carry out a formal verification of TCP to confirm this conjecture.

6.8

Inferring all retransmission events

Through a combination of model checking and by reasoning formally about protocol behavior, we have unearthed a set of events (E1 - E5 for GB(N ), and E3 - E7 for the F astRetx(N ) protocol) in which a retransmitted packet is not detected to be out-ofsequence at the measurement point. We now discuss, whether some or all of these missing retransmissions can in fact be detected by the measurement point using additional information about the behavior of these protocols. As an example of this we first describe how the measurement can use the knowledge of the sender’s window to infer some of the previously missing retransmission events in GB(N ) and F astRetx(N ) senders. As mentioned before, a GB(N ) and a F astRetx(N ) sender has a constant window of size N . Also, since we assume the sender always has data to send (and it transmits a packet as soon as it is allowed by its window), the sender must have N unacknowledged packets any time the retransmission timer goes off. Hence, a timeout event will trigger the retransmission of exactly N packets in GB(N ) and F astRetx(N ) protocols. This characteristic of GB(N ) and F astRetx(N ) protocols can be used to aid the detection of retransmitted packets. When the measurement point detects the first retransmitted packet (i.e. the first packet classified to be a retransmission after an insequence packet), it knows this packet must be followed by N − 1 additional retransmissions. Thus even if any of the subsequent N − 1 packets are not out-of-sequence, it can correctly identify these packets to be retransmissions. Using this additional informa-

96

tion the measurement point can detect retransmitted packets that were not deemed outof-sequence in events {E1 (x, t00 ), E3 (x, t00 ), E4 (x, t00 )} in the GB(N ) protocol and events {E3 (x, t00 ), E4 (x, t00 ), E6 (x, t00 )} in the F astRetx(N ) protocol. Obviously, if all N packets are lost before the measurement point and subsequently retransmitted then this additional information is of no use, since even the first of these retransmitted packets can not be detected as an out-of-sequence packet. Barring this special case, all retransmitted packets in E2 (x, t00 ) can then also be detected in both GB(N ) and F astRetx(N ) protocol. Apart from the knowledge of the sender window, the relative arrival times at the measurement time between packets within the same window can also be used to detect missing retransmitted packets. In event E5 (x, t00 ) (in both GB(N ) and F astRetx(N ) protocols) a retransmitted packet is not detected to be out-of-sequence since it is reordered in sequence with a packet with sequence number greater than x. In other words, at the measurement point the retransmitted packet with sequence number x appears in sequence after the packet with sequence number x − 1. However, in this case the interval between the arrival of packets with sequence numbers x − 1 and x will be greater than the RT T of the connection. Since it is not possible for two packets in the same window to be separated by this interval, the measurement can infer the packet with sequence number x as a retransmission. TCP is different from GB(N ) and F astRetx(N ) in that the sender has a dynamic window. However, if we assume the TCP sender is greedy and that the measurement point’s estimate of the sender’s window is accurate, a similar principle as described above can be applied to TCP to detect packet retransmissions that would otherwise not be detected as out-of-sequence. As of now our techniques rely on only one (forward) pass over the packets of the TCP flow. However, if the measurement point detects the sender behavior is not in accordance with the measurement point’s estimate of the sender window, and if it could backtrack in time to account for this difference by an alternate sequence of events, then it can potentially infer better estimates of the TCP sender’s variables and more accurately track the state of

97

the TCP sender and events along the end-end path. One example is when an entire window of packets is lost before the measurement point. Since the measurement point does not detect this loss event, it will not reduce it’s estimate of the sender’s window - resulting in an over-estimate of the sender’s cwnd. However this would manifest itself in the sender not behaving greedily with respect to the measurement point’s estimate of the window. Upon detecting this, if the measurement point could back-track on its decision and reset the value of the sender’s window it could both detect the loss event and also have an accurate estimate of the sender’s window. Another example is when the measurement point erroneously classified a reordered packet as a retransmission, and as a results cuts down its estimate of the sender’s cwnd. This error would again be apparent when the sender transmits more packets than predicted by the measurement point’s estimate of its window. Once again, if the measurement could back-track, and take an alternate course of action, it could rectify the error in the classification decision.

6.9

Conclusions

In this chapter, we have proposed a novel approach, combining model checking techniques and formal reasoning, to verify our techniques for detecting and classifying packet retransmissions and reorderings by a passive monitor in an end-end path. Using the SPIN model checker and by reasoning about network and protocol behavior, we unearth all possible cases in which such packets go undetected at the passive monitor. A natural next step for this work would be to use the insights gained through the formal verification of the inference rules to design new techniques that can capture all packet retransmissions and reorderings. One possible approach would be to drive inferences based on the most likely set of events in the network, using ideas from the body of work on passive testing [33, 32] of protocols. While here we have considered a specific passive measurements based inference problem, we believe our approach can be applied to other measurement scenarios with similar challenges and constraints.

98

CHAPTER 7 CONCLUSIONS AND FUTURE WORK

In this thesis we have presented an approach to infer performance measures of the endend path of a TCP connection and track critical TCP variables of a TCP sender, using passively collected measurements from a single point of observation in the end-end path. The approach is important in that it allows inferences to be made about a large number of end-end paths, and end hosts, and does so with a minimum amount of instrumentation and control required over the existing infrastructure. Inferences made using our approach can be used to diagnose the cause and location of performance problems in the end-end path. We have focussed on TCP connections since TCP is by far the dominant protocol currently deployed in the Internet, and also the congestion and flow control mechanisms in TCP allow us to detect, distinguish and understand the properties of the end-end path and factors affecting sender behavior. Our techniques can infer end-end path performance measures (extent of packet loss and reordering, round-trip delay), characteristics of the TCP sender (the sender window, congestion control flavor) and factors that affect the rate of the TCP sender (limited by application, end-host buffers, congestion etc.). We have empirically validated and then applied our techniques on traces collected from a large Tier-1 backbone network. Based on these measurements we have made the following observations: • About 4% of packets generated by TCP connections are out-of-sequence, most of which are due to retransmission in response to a packet loss;

99

• Packet reordering affects about 1-1.5% of all data packets, however they have little impact on the quality of the connection as perceived by the end users; • Other network anomalies such as duplication of packets occur rarely in the Internet. • The sender throughput is often limited by lack of data to send, rather than by network congestion. • For the few cases where TCP flavor is distinguishable, we find that NewReno is the dominant congestion control algorithm implemented. • Connections experience considerable RTT variation in their lifetime. The ratio between the 95th and 5th percentile of the RTT is less than 3 for approximately 80-85% of the connections. Finally, we propose a novel approach, combining model checking techniques and formal reasoning, to verify the correctness of our proposed inference rules. We believe our formal verification based approach has application in checking the correctness of other inference techniques in measurement scenarios with similar challenges and constraints.

7.1

Future directions

In our opinion, this thesis represents a fundamental step for addressing a wide range of research questions concerning the nature of Internet traffic and properties of phenomenon within the Internet. We now outline some possible research directions and unanswered questions that have emerged from this work: • Identifying specific internal network elements with performance problems: In our work we have developed passive measurement techniques to get an estimate of end-end loss and delay. The next step would be to develop techniques that infer performance characteristics of individual path segments along the end-end path. One approach would be to correlate the performance of TCP connections with shared 100

segments along their end-end paths (using ideas pioneered in the MINC [15] project), to identify the location of packet losses. Also, it will help answer another important question - whether connections experience congestion in one or multiple points in the network. • Identifying the cause behind losses: Is the loss within a network due to congestion, or due to routing changes, link failures and other misconfigurations? We believe congestive-loss is indicated by some amount of non-lost packets getting through, even when there is high loss. Thus if we were to see a series of packets all to the same destination with most getting lost and only a few going through we would still declare it to be congestive loss. On the other hand if we observe an outage of significant length, in which every packet going towards a destination was lost, this indicates a non-congestive loss event such as routing failure or a down link. In order to apply this idea we could , by combining information from multiple measurement points, first identify packet losses that occur within a network. We can then characterize the losses into either of the two above described categories and further validate our conjecture by correlating loss events with link-outages and route change events within the network. • Passive testing for more accurate inferences: As described earlier, in order to track the value of the TCP sender’s variables, the measurement point creates a replica of the sender’s finite state machine based on the arrival of data packets, their corresponding ACKs, and its inference of loss events. The goal of the measurement point is for the transitions of its replica state machine to be in lock step with the sender’s state machine. However as we have discussed earlier, this is not always possible - events in the network can result in the sender state machine and the measurement point replica falling out of sync with each other, causing incorrect inferences of the sender’s variables. In order to correct these errors, a prospective direction would be to adapt

101

ideas from the body of work on passive testing [33, 32] of protocols. Passive testing is a process of detecting faults, errors or the current state of a system (such as a network protocol) under test by passively observing its input/output behavior. Using this approach, one can test the accuracy of inferred variable values of TCP senders and develop algorithms to determine the most likely current state of a TCP sender; or the most likely value of its variables. • Online implementation of inference techniques: The techniques described in our thesis have been implemented to process the input traces off-line. An interesting problem would be to explore if these techniques can be adapted to process data input in a streaming fashion, at line rates. In order to do this, the first challenge would be to evaluate whether any implementation of our techniques would be fast enough to keep up with both the average line rate, and any intermittent traffic bursts at an higher rate. In general, we have found the current (off-line) implementation takes less time to process the traces as compared to the actual duration of trace. For example, a two hour long trace collected over a GigaBit Ethernet link (with an average of 2 million packets observed per second) was processed in under an hour on a PC with a 2Ghz processor and 1GB RAM. Another challenge would be the memory requirements of such an online implementation. Recall that in order to make its inferences, the measurement point recreates the state machine of the TCP sender of each active TCP flow. The number of such active flows can be in the order of tens of thousands depending on the nature of the link monitored (e.g., on a GigaBit Ethernet link, we observed approximately 100,000 active flows at all times). An interesting question would be how to minimize the amount of state required and maximize the efficiency with which this state is managed, for each such active flow. It would also be of interest in this scenario, to study the trade-offs between implementing modifications of our techniques that require less resources vs. the resulting accuracy of their estimates.

102

APPENDIX A PSEUDO-CODE FOR INFERRING TCP VARIABLES AND CONGESTION CONTROL FLAVOR

procedure updateCwnd Inputs : Argument packetT ype num dupacks ack seqno next seqno ocwnd numACKed Variables : Variable name cwnd ssthresh recover state M SS awnd

Input type RTO Data | ACK | Dup ACK Integer Sequence number Sequence number Integer Integer Type Integer Integer Sequence number DEFAULT DEFAULT | FAST RECOVERY Bytes Integer

Outputs Name cwnd, ssthresh, ocwnd

Remarks packet types which cause this procedure to be invoked number of duplicate ACKs sequence number of last ACKed data packet sequence number of the next expected data packet value of cwnd before it deflates, used by procedure flavorTest number of packets cumulatively acked by a new ACK, used only by NewReno Remarks sender’s congestion window estimate sender’s slow start threshold estimate for New Reno, sequence number of last data packet observed before entering fast recovery Tahoe states Reno and NewReno states global variable, Maximum Segment Size for this connection Another global variable, the receiver’s advertised window

Remarks updated values of cwnd, ssthresh and ocwnd

Input Format - for Tahoe, Reno Input Format - for NewReno

Figure A.1. Variables required to keep track of the cwnd of the three TCP flavors

103

procedure incr cwnd Inputs : cwnd, ssthresh Outputs : cwnd if(cwnd < ssthresh) cwnd + +; else cwnd+ = 1/cwnd

Figure A.2. Function used to update cwnd during slow-start and congestion avoidance

State

Input

Action

incr cwnd(cwnd, ssthresh); ssthresh = max(min(awnd, cwnd)/2, 2); if (f lightsize < min(cwnd, awnd)) ocwnd = min(cwnd, awnd) elseocwnd = 0 cwnd = 1;

ssthresh = max(min(awnd, cwnd)/2, 2); cwnd = 1; ocwnd = 0;

Figure A.3. Pseudocode to track congestion window for Tahoe

104

State

Input

Action incr cwnd(cwnd, ssthresh); cmp ocwnd(cwnd, ocwnd);

ssthresh = max(min(awnd, cwnd)/2, 2); if (f lightsize < min(cwnd, awnd)) ocwnd = min(cwnd, awnd) else ocwnd = 0 cwnd = cwnd/2 + 3; state = FAST RECOVERY;

ssthresh = max(min(awnd, cwnd)/2, 2); cwnd = 1; ocwnd = 0;

= 3, ack seqno, next seqno, ocwnd>

if (f lightsize < min(cwnd, awnd) and ocwnd == 0) ocwnd = min(cwnd, awnd) else ocwnd = 0 cwnd = ssthresh; num dupacks = 0; state = DEFAULT; cwnd + +; cmp cwnd(cwnd, ocwnd); ssthresh = max(min(awnd, cwnd)/2, 2); cwnd = 1; ocwnd = 0; num dupacks = 0; state = DEFAULT;

Figure A.4. Pseudocode to track congestion window for Reno

105

State

Input

Action incr cwnd(cwnd, ssthresh); cmp ocwnd(cwnd, ocwnd);

ssthresh = max(min(awnd, cwnd)/2, 2); if (f lightsize < min(cwnd, awnd) ocwnd = min(cwnd, awnd) cwnd = cwnd/2 + 3; recover = next seqno − 1 state = FAST RECOVERY;

ssthresh = max(min(awnd, cwnd)/2, 2); cwnd = 1; ocwnd = 0;

if (f lightsize < min(cwnd, awnd)) ocwnd = min(cwnd, awnd) else ocwnd = 0 num dupacks = 0; if (ack seqno > recover) state = DEFAULT; recover = 0; cwnd = ssthresh; else cwnd = cwnd − numACKed + 1;

cwnd + +; cmp cwnd(cwnd, ocwnd); ssthresh = max(min(awnd, cwnd)/2, 2); cwnd = 1; ocwnd = 0; recover = 0 num dupacks = 0; state = DEFAULT;

Figure A.5. Pseudocode to track congestion window of NewReno

procedure cmp ocwnd Inputs : cwnd, ocwnd Outputs : ocwnd if(cwnd > ocwnd) ocwnd = 0;

Figure A.6. Function used to reset ocwnd

106

procedure flavorTest Inputs : Argument f lavor cwnd ocwnd num dupacks ack seqno cur seqno f lightsize inF astrecovery Variables : Name isF alse M SS

Type Integer Bytes

Input type TAHOE | RENO | NEWRENO Integer Integer Integer Sequence number Sequence number Integer Integer

Remarks flavor being tested estimate of cwnd for this flavor value of cwnd before it deflated Number of duplicate ACKs last sequence number ACKed sequence number of current data packet the number of packets currently unacknowledged flag which denotes if the state m/c for this flavor is in the fast recovery state

Remarks flag which indicates whether this packet was allowed by this flavor Maximum Segment Size for this connection

Outputs : isF alse Input Format : isF alse = 0; if(f lightsize > max(cwnd, ocwnd)) if(cur seqno > ack seqno) isF alse = 1; if(cur seqno == ack seqno) if(f lavor 6= N EW REN O) if(num dupacks < 3) isF alse = 1; elsif (inF astrecovery 6= 1) isF alse = 1;

Figure A.7. Testing TCP flavors

107

procedure rttEstimate Inputs: Argument Input type Remarks packetType New Data | Retx Data | Dup ACK packet types which cause this procedure to be invoked num dupacks Integer number of duplicate ACKs Integer current estimate of the sender’s usable window, i.e. min(awnd, cwnd) uwnd time the current time cur seqno Sequence number sequence number of a New Data packet ack seqno Sequence number most recent ACK Variables: Name Type Remarks state DEFAULT | FROZEN whether RTT estimation is currently frozen or on startTime time at which this estimate for the RTT started startSeqno Sequence number sequence number of data packet that starts the RTT sample sampleSeqno Sequence number sequence number of data packet that completes the RTT sample MSS Bytes Maximum Segment Size for this connection Outputs: Name Remarks rtt estimate current RTT sample Input Format State

Input

Action

startSeqno = cur seqno sampleSeqno = N OT SET startT ime = time state = DEFAULT

if (ack seqno > startSeqno) sampleSeqno = startSeqno + uwnd ∗ M SS

state = FROZEN

state = FROZEN

if (cur seqno == sampleSeqno) rtt = time − startT ime startSeqno = cur seqno sampleSeqno = N OT SET startT ime = time

Table A.1. Pseudo-code describing the RTT estimation scheme

108

APPENDIX B PROMELA CODE FOR GB(N ) PROTOCOL

#define WND 3 /* value of sender window */ #define MAXSEQ 7 /* maximum sequence number value used sender*/ #define SEQRANGE 3 /* range of sequence numbers in bits */ #define WNDRANGE 2 /* range of window values in bits */

/* modulo increment of sequence numbers */ #define inc(x) x = (x + 1) % (MAXSEQ) /* packet types */ mtype = {data, retx, reord}; /* struct that defines fields carrid by data packets traversing the channels */ typedef mdataPkt { byte seqno; /* seqno */ mtype type; /* one of the 3 types defined above */ byte tstamp; /* timestamp carried by data packet */ };

typedef ackPkt { byte ackno; /* the acks carry only sequence numbers */ }; /* sender to channel daemon 1*/ chan qsc1 = [WND] of {mdataPkt}; /* daemon 1 to measurement point channel */ chan qc1m = [WND] of {mdataPkt}; /* measurement point to receiver channel */ chan qmr = [WND] of {mdataPkt}; /* receiver to sender channel */ chan qrs = [WND] of {ackPkt};

/* * inline function to chk if "curseq" is smaller * than "lastseq" in modulo * space, flag ism set to 1 if it is. */ inline isSmaller(curseq, lastseq, ism, i) { ism = 1; i = 1; do :: (i if :: (curseq == ((lastseq + i) % (MAXSEQ))) -> :: (curseq != ((lastseq + i) % (MAXSEQ))) -> fi; :: else -> break; od; } inline windowBefore(curseq, seq) {

109

ism = 0; break; i++;

if :: (seq >= WND) -> curseq = seq - WND; :: else -> curseq = MAXSEQ - (WND - seq); fi; }

active proctype sender() { unsigned NextFrame:SEQRANGE; /* unsigned AckExp:SEQRANGE; /* marks beginning unsigned snd_nbuf:WNDRANGE; /* unsigned xr qrs; xs qsc1;

next seq number to send */ the current ACK expected by sender of sender’s window */ # of used buffers */

s:SEQRANGE;

do /* if pkts inflight < WND, send new pkts */ :: atomic { snd_nbuf < WND -> /* send packet with new seq. number, type "data" and timestamp 0 */ qsc1!NextFrame,data,0; snd_nbuf++; inc(NextFrame); } /* recv new ACK */ :: atomic { qrs ? s -> do :: ((AckExp break; od; } /* sender retransmits packets upon a timeout */ :: timeout -> atomic { NextFrame = AckExp; s = 1; do :: s /* send packet with type "retx" and timestamp 2 */ qsc1!NextFrame,retx,2; inc(NextFrame); s++; :: else -> break; od; } od; }

active proctype cdaemon1() { mdataPkt pkt, pkt2; bool firsttime; xr qsc1; xs qc1m;

do :: atomic { /* chk if pkt in channel */ qsc1 ? pkt -> /* either let it go */

110

do

:: qc1m ! pkt; break; /* drop it */ :: break; /* or Reorder the packets */ :: (len(qsc1) >= 1) -> { qsc1?pkt2 ; if :: (pkt.type != retx) -> pkt.type = reord; pkt.tstamp = 1; :: else -> skip; fi; /* send pkts out in reverse order */ qc1m ! pkt2; qc1m ! pkt; break; } od;

} od; } active proctype mpoint() { unsigned i:SEQRANGE, max:SEQRANGE; bool ism, firsttime = 1; mdataPkt pkt ; byte Observed[MAXSEQ]; xr qc1m; xs qmr; do :: atomic { qc1m ? pkt -> /* chk if observed packet is OoS */ if :: (firsttime != 1) -> isSmaller(pkt.seqno, max, ism, i); :: (firsttime == 1) -> d_step {ism = 0;} fi; if /* is OoS */ :: (ism == 1) -> /* based on the packet’s timestamp, * if it reord or retx */ assert(pkt.type == retx || pkt.type if :: (pkt.tstamp - Observed[pkt.seqno] assert(pkt.type == reord); :: (pkt.tstamp - Observed[pkt.seqno] assert(pkt.type == retx); fi;

confirm == reord); >= 2) ->

/* Normal pkt in sequence */ :: (ism == 0) -> assert(pkt.type == data); inc(max); /* normal packet in sequence, store its tstamp */ do :: (max == pkt.seqno) -> { Observed[pkt.seqno] = pkt.tstamp; break; } :: else -> { Observed[pkt.seqno] = pkt.tstamp; inc(max); }

111

od; max = pkt.seqno; fi; windowBefore(i, pkt.seqno); Observed[i] = 0;

/* * this chapter of code emulates the channel * daemon sitting between the measurement point * and the receiver */ do :: qmr ! pkt; /* either let the packet go */ break; /* or drop it*/ :: break; od; if :: (firsttime == 0) -> skip; :: (firsttime == 1) -> firsttime = 0; fi; } od; }

active proctype receiver() { mdataPkt pkt; unsigned FrameExp:SEQRANGE, ack2send:SEQRANGE; bool dupack = 0; xr qmr; xs qrs; do :: atomic { qmr ? pkt -> dupack = 0; if /* if expected pkt, send ack */ :: pkt.seqno == FrameExp -> ack2send = FrameExp; inc(FrameExp); :: else -> if :: (FrameExp == 0) -> ack2send = (MAXSEQ - 1); :: else -> ack2send = FrameExp - 1; fi; dupack = 1; fi; /* * this chapter of code emulates the channel * daemon sitting between the receiver and the * sender. */ do /* either let it go */ :: qrs ! ack2send; break; /* or drop it */ :: break; od; } od; }

112

APPENDIX C PROMELA CODE FOR F AST RET X(N ) PROTOCOL

#define WND 4 /* value of sender window */ #define MAXSEQ /* maximum sequence number value used by sender */ #define SEQRANGE 4 /* range of sequence numbers in bits */ #define WNDRANGE 3 /* range of window values in bits */

9

/* modulo increment of sequence numbers */ #define inc(x) x = (x + 1) % (MAXSEQ) /* packet types */ mtype = {data, oos}; /* struct that defines fields carrid by data packets traversing the channels */ typedef mdataPkt { byte seqno; /* seqno */ mtype type; /* one of the 3 types defined above */ }; /* * in this code we have abstracted the network model further * than what is described in the paper. Instead of having separate * channel daemons between the sender and the measurement point, * the measurement point and the receiver and the receiver and the * sender, we compress the functionality of these daemons into the * the sender, measurement point and the receiver daemons respectively. * This consequently results in fewer message channels used, and thus * prevents a state-explosion from occuring for this model */ /* sender to channel daemon 1*/ chan qsm

= [WND] of {mdataPkt};

/* measurement point to receiver channel */ chan qmr = [WND] of {mdataPkt}; /* receiver to sender channel */ chan qrs = [WND] of {byte

};

/* * inline function to chk if "curseq" is smaller than "lastseq" in modulo * space, flag ism set to 1 if it is. */ #define isSmaller(curseq, lastseq, ism) ism = ((curseq != (lastseq + 1)%MAXSEQ) && (curseq != (lastseq inline cdaemon1(pkt, beginwnd, usedbuf, dupAcks, firsttime, NextFrame, fastretx) { /*either let it go */ do :: qsm ! pkt; break; /* or drop it */ :: break;

113

/* or reorder packets */ ::(!fastretx ) -> atomic { if :: (usedbuf < WND) -> { inc(NextFrame); qsm!NextFrame, pkt.type; usedbuf++; pkt.type = oos; qsm!pkt; } :: else -> qsm!pkt; fi; } break; od; } active proctype sender() { unsigned NextFrame:SEQRANGE, s:SEQRANGE, lastAck:SEQRANGE, snd_nbuf:WNDRANGE, temp:SEQRANGE, sndhigh:SEQRANGE, dupAcks:WNDRANGE; unsigned AckExp:SEQRANGE; mdataPkt pkt; bool firsttime = 1, firstack = 1, fastretx = 0, ism; xr qrs; xs qsm; do /* if pkts inflight < WND, send new pkts */ :: atomic { snd_nbuf < WND -> pkt.seqno = NextFrame; pkt.type = data; snd_nbuf++; cdaemon1(pkt, temp, snd_nbuf, dupAcks, firsttime, NextFrame, 0); inc(NextFrame); } /* recv ACK */ :: qrs ? s -> atomic { do :: (lastAck == s && !firstack) -> { isSmaller(NextFrame, ((s + 1)%MAXSEQ), ism); dupAcks = (!ism -> (dupAcks + 1):dupAcks); if ::(dupAcks == 2) -> { pkt.seqno = (s + 1)%MAXSEQ; pkt.type = oos; fastretx = 1; cdaemon1(pkt, temp, snd_nbuf, dupAcks, firsttime, NextFrame, fastretx); fastretx = 0; } :: else -> skip; fi; break; } :: (lastAck != s || firstack) -> d_step { do :: (((AckExp break; od; } break; od; } /* * sender retransmits packets */ :: timeout -> atomic { NextFrame = AckExp; dupAcks = 0; s = 1; do :: (s pkt.seqno = NextFrame; pkt.type = oos; s++; cdaemon1(pkt, temp, s, dupAcks, firsttime, NextFrame, 0); inc(NextFrame); :: else -> break; od; } od; }

active proctype mpoint() { unsigned max:SEQRANGE; bool ism, firsttime = 1; mdataPkt pkt ; xr qsm; xs qmr; do :: atomic { qsm ? pkt -> /* chk if observed packet is OoS */ if :: (firsttime != 1) -> isSmaller(pkt.seqno, max, ism); :: (firsttime == 1) -> {ism = 0;} fi; if /* is OoS */ :: (ism == 1) -> assert(pkt.type == oos); /* Normal pkt in sequence */ :: (ism == 0) -> assert(pkt.type == data); max = pkt.seqno; fi; /* send it to receiver */ /* either let it go */ do :: qmr ! pkt; printf("\nINTC2: send: %d", pkt.seqno); break; /* or drop it */ :: break; od; firsttime = (firsttime -> 0:firsttime); } od;

115

}

active proctype receiver() { unsigned rcvseq:SEQRANGE; mtype type; unsigned FrameExp:SEQRANGE, i:SEQRANGE, ack2send:SEQRANGE; bool InBuffer[MAXSEQ], ism; xr qmr; xs qrs; do :: atomic { qmr ? rcvseq,type -> i = 0; if /* if expected pkt, send ack */ :: (rcvseq == FrameExp) -> atomic { i = 1; do ::(i { if :: (InBuffer[(FrameExp + i)%MAXSEQ] == 1) -> InBuffer[(FrameExp + i)%MAXSEQ] = 0; i++; :: else -> break; fi; } :: else -> break; od; FrameExp = (FrameExp + (i - 1))%MAXSEQ; ack2send = FrameExp; inc(FrameExp); } /* send duplicate ACK */ :: else -> d_step { isSmaller(FrameExp, rcvseq, ism); InBuffer[rcvseq] = (ism -> 1:InBuffer[rcvseq]); ack2send = ((FrameExp == 0) -> (MAXSEQ - 1):(FrameExp - 1)); } fi; /* send out ack */ do :: qrs ! ack2send; printf("\nINTC3: send: %d", ack2send); break; :: break; od; } od; }

116

BIBLIOGRAPHY

[1] http://spinroot.com/spin/whatispin.html#d. [2] Verification and improvement of the sliding window protocol. Lecture Notes in Computer Science 2619 (2003), 113–127. [3] Aikat, J., Kaur, J., Smith, F.D., and Jeffay, K. Variability in tcp roundtrip times. In Proceedings of the Internet Measurements Workshop (Oct. 2003). [4] Akella, Aditya, Seshan, Srinivasan, and Shaikh, Anees. An empirical evaluation of wide-area internet bottlenecks. In Proceedings of ACM Internet Measurements Conference (Miami). [5] Allman, Mark, Floyd, Sally, and Partridge, Craig. Increasing TCP’s initial window. RFC 2414, Sept. 1998. [6] Allman, Mark, Paxson, Vern, and Stevens, W. Richard. TCP congestion control. RFC 2581, Apr. 1999. [7] Bellardo, John, and Savage, Stefan. Measuring packet reordering. In Proceedings of the Internet Measurements Workshop (San Francisco, Nov. 2002). [8] Benko, Peter, and Veres, Andras. A passive method for estimating end-to-end tcp packet loss. In Proceedings of IEEE Globecom (Taipei, Nov. 2002). [9] Bennett, J.C.R., Partridge, C., and Shectman, N. Packet reordering is not pathological network behavior. IEEE/ACM Transactions on Networking 7, 6 (Dec. 1999). [10] Bertsekas, D., and Gallagher, R. Data Networks, 2nd ed. Prentice Hall, 1991. [11] Bhargavan, Karthikeyan, Obradovic, Davor, and Gunter, Carl A. Formal verification of standards for distance vector routing protocols. J. ACM 49, 4 (2002), 538–576. [12] Bolot, J.C. End-to-end packet delay and loss behavior in the internet. In Proceedings of ACM Sigcomm (1993). [13] Borella, M., Swider, D., Uludag, S., and Brewster, G. Internet packet loss: Measurement and implications for end-to-end qos. In International conference on Parallel Processing (Aug. 1998). [14] Brakmo, Lawrence, and Peterson, Larry. TCP Vegas: End to end congestion avoidance on a global internet. IEEE Journal on Selected Areas in Communications 13, 8 (Oct. 1995), 1465–1480. 117

[15] C´aceres, R., Duffield, N., Towsley, D., and Horowitz, J. Multicast-based inference of network-internal loss characteristics. IEEE Transactions on Information Theory 45, 7 (Nov. 1999), 2462–2480. [16] Casetti, C., Gerla, M., Mascolo, S., Sanadidi, M. Y., and Wang, R. TCP Westwood: Bandwidth estimation for enhanced transport over wireless links. In Proceedings of ACM Mobicom (July 2001), pp. 287–297. [17] Coates, M. J., and Nowak, R. Network loss inference using unicast end-to-end measurement. In Proceedings of ITC Conference on IP Traffic, Modelling and Management (Monterey, 2000). [18] Coates, M. J., and Nowak, R. Network delay distribution inference using unicast endto-end measurement. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (Salt Lake City, May 2001). [19] Dong, Yifei, Du, Xiaoqun, Holzmann, Gerard J., and Smolka, Scott A. Fighting livelock in the gnu i-protocol: a case study in explicit-state model checking. STTT 4, 4 (2003), 505–528. [20] Floyd, Sally. Tcp and successive fast retransits. Unpublished Note, 1995. [21] Floyd, Sally, and Henderson, Thomas. The NewReno modification to TCP’s fast recovery algorithm. RFC 2582, Apr. 1999. [22] Fraleigh, C., Moon, S., Lyles, B., Cotton, C., Khan, M., Moll, D., Rockell, R., Seely, T., and Diot, C. Packet-level traffic measurements from the sprint IP backbone. IEEE Network (2003). [23] Holzmann, G.J. The SPIN Model Checker, 1st ed. Addison-Wesley, Jan. 2003. [24] Jacobson, V., Braden, R., and Borman, D. TCP extensions for high performance. RFC 1323, May 1992. [25] Jaiswal, S., Iannaccone, G., Diot, Christophe, Kurose, J., and Towsley, D. Measurement and classification of out-of-sequence packets in a tier-1 ip backbone. In Proceedings of Infocom (San Francisco, 2003). [26] Jaiswal, Sharad, Iannaccone, Gianluca, Diot, Christophe, Kurose, Jim, and Towsley, Don. Inferring TCP connection characteristics through passive measurements (extended version). Tech. Rep. RR03-ATL-070121, Sprint ATL, July 2003. [27] Jaiswal, Sharad, Iannaccone, Gianluca, Diot, Christophe, Kurose, Jim, and Towsley, Don. Inferring TCP connection characteristics through passive measurements. In Proceedings of Infocom (Hong Kong, 2004). [28] Jaiswal, Sharad, Iannaccone, Gianluca, Kurose, Jim, and Towsley, Don. Formal verification of passive measurement techniques. Tech. Rep. 05-XX, Computer Science Dept., UMass Amherst, May 2005. 118

[29] Jiang, Hao, and Dovrolis, Constantinos. Passive estimation of TCP round-trip times. ACM Computer Communication Review 32, 3 (July 2002). [30] Kleinrock, L., and Naylor, W. On measured behavior of the ARPA network. In Proceedings of National Computer Conference (1974). [31] Kurose, James F., and Ross, Keith W. Computer Networking: A Top-Down Approach Featuring the Internet. Pearson Benjamin Cummings, 2004. [32] Lee, D., Chen, D., Hao, R., Miller, R.E., Wu, J., and Yin, X. A formal approach for passive testing of protocol portions. In IEEE International Conference on Network Protocols (Nov. 2002). [33] Lee, D., and Yannakkis, M. Principles and methods of testing finite state machines a survey. Proceedings of the IEEE 84 (Aug. 1996). [34] LoPresti, F., Duffield, N. G., Horowitz, J., and Towsley, D. Multicast based inference of network internal characteristics - delay distributions. vol. 27. [35] Martin, H.S., McGregor, A.J., and Cleary, J.G. Analysis of internet delay times. In Proceedings of Passive and Active Measurment Workshop (PAM) (2000). [36] Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A. TCP selective acknowledgement options. RFC 2018, Oct. 1996. [37] Miller, R. Passive testing of networks using a CFSM specification. In Proceedings of the IEEE International Performance, Computing and Communications Conference (Feb. 1998). [38] Musuvathi, Madanlal, and Engler, Dawson R. Model checking large network protocol implementations. In NSDI (2004), pp. 155–168. [39] Musuvathi, Madanlal, Park, David Y. W., Chou, Andy, Engler, Dawson R., and Dill, David L. Cmc: A pragmatic approach to model checking real code. In OSDI (2002). [40] Padhye, Jitendra, and Floyd, Sally. On inferring TCP behavior. In Proceedings of ACM Sigcomm (Aug. 2001). [41] Padmanabhan, Venkata N., Qiu, Lili, and Wang, Helen J. Server based inference of Internet link lossiness. In Proceedings of IEEE Infocom (San Francisco, Apr. 2003). [42] Paxson, Vern. Automated packet trace analysis of TCP implementations. [43] Paxson, Vern. End-to-end internet packet dynamics. IEEE/ACM Transactions on Networking 7, 3 (June 1999), 277–292. [44] Ramakrishnan, K.K., Floyd, Sally, and Black, David. The addition of explicit congestion notification (ECN) to IP. RFC 3168, Sept. 2001.

119

[45] Rizzo, Luigi. Dummynet: a simple approach to the evaluation of network protocols. ACM Computer Communication Review 27, 1 (Jan. 1997). [46] Savage, Stefan. Sting: a tcp-based network measurement tool. In Proceedings of the 1999 USENIX Symposium on Internet Technologies and Systems (Miami, Oct. 1999). [47] Smith, Mark. Formal verification of communication protocols. In Formal Description Techniques IX: Theory, Applications, and Tools FORTE/PSTV’96: Joint International Conference on Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification (1996). [48] Smith, Mark A., and Ramakrishnan, K. K. Formal specification and verification of safety and performance of tcp selective acknowledgment. IEEE/ACM Trans. Netw. 10, 2 (2002), 193–207. [49] Tanenbaum, Andrew S. Computer Networks. Prentice-Hall, 1988. [50] Thompson, Kevin, Miller, Gregory J., and Wilder, Rick. Wide-area internet traffic patterns and characteristics. IEEE Network Magazine (Nov. 1997). [51] Yajnik, M., Kurose, S.B. Moonand J., and Towsley, D. Measurement and modeling of the temporal dependance in packet loss. In Proceedings of IEEE Infocom (Mar. 1999). [52] Zhang, Yin, Breslau, Lee, Paxson, Vern, and Shenker, Scott. On the characteristics and origins of internet flow rates. In Proceedings of ACM Sigcomm (Aug. 2002).

120

“measurements-in-the-middle” : inferring end-end path properties and characteristics of tcp ...

Short Description

Description

Comments