#### AN ABSTRACT OF THE DISSERTATION OF

Rajesh Inti for the degree of <u>Doctor of Philosophy</u> in <u>Electrical and Computer Engineering</u> presented on <u>November 28, 2011</u>. Title: Highly Digital Power Efficient Techniques for Serial Links.

Abstract approved: \_

#### Pavan Kumar Hanumolu

Low power, high speed serial transceivers are employed in a wide range of applications ranging from chip-to-chip, backplane, and optical interconnects. Apart from being capable of handling a wide range of data rates, the transceivers should have low power consumption (mW/Gbps) and be fully integrated. This work discusses enabling techniques to implement such transceivers. Specifically, three designs: (1) a 0.5-4 Gbps serial link which uses current recycling to reduce power dissipation and (2) a 0.5-2.5 Gbps reference-less clock and data recovery circuit which uses a novel frequency detector to achieve unlimited acquisition range and (3) a 2-4 Gbps low power receiver architecture capable of resolving multiple signalling formats with a simplified XOR based phase rotating PLL will be presented. All the three circuit topologies are highly digital and aim to address the requirements of wide operating range, low power dissipation while being fully integrated. Measured results obtained from the prototypes illustrate the effectiveness of the proposed design techniques. <sup>©</sup>Copyright by Rajesh Inti November 28, 2011 All Rights Reserved

### Highly Digital Power Efficient Techniques for Serial Links

by

Rajesh Inti

### A DISSERTATION

submitted to

Oregon State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Presented November 28, 2011 Commencement June 2012 Doctor of Philosophy dissertation of <u>Rajesh Inti</u> presented on November 28, 2011

APPROVED:

Major Professor, representing Electrical and Computer Engineering

Director of the School of Electrical Engineering and Computer Science

Dean of the Graduate School

I understand that my dissertation will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my dissertation to any reader upon request.

Rajesh Inti, Author

#### ACKNOWLEDGMENTS

It is a pleasure to thank all the professors, colleagues and friends for making my ambition to pursue a doctoral degree a very rewarding journey with its ebbs and flows. It is difficult to express gratitude to my advisor, Prof. Pavan Hanumolu. He has been a constant source of inspiration and enthusiasm. His tremendous effort to explain things clearly and vividly, made working with him the best part of my PhD journey. Throughout my work at OSU, he provided encouragement, sound advice, good teaching, good company, and lots of good ideas.

I would like to extend my sincerest gratitude to Prof. Un-Ku Moon, Prof. Karti Mayaram and Prof. Gabor Temes for serving on my thesis committee. Their constructive comments and questions have always helped provide a different perspective to understanding circuit topologies. I would also like to thank Prof. William Warnes for taking time off his schedule and being the graduate council representative in my Ph.D. program.

I would like to thank the many people who have made this journey extremely enjoyable by providing a stimulating and fun environment: Amr Elshazly, Sachin Rao, Wenjing Yin, Brian Young, Seokmin Jung, Qadeer Khan, Mrunmay Talegaonkar, Reddy Karthikeyan, Tushar Uttarwar, Abhijith Arakali, Sarvesh Bang, Bangda Yang, Brian Drost, Romesh Nandawana and Saurabh Saxena from Prof. Hanumolu's group, Sasidhar, Hariprasath Venkatram, Manideep Gande, Tawfiq Musah, Sunwoo Kwon, Nima Maghari and Ho-Young Lee from Prof. Moon's group, Ankur Guha Roy and Wai Leng Cheong from Prof. Mayaram's group, Kangmin Hu and Jacob Postman from Prof. Chiang's group. Friday evening gettogethers with Gopi, Ravi, Srikar, Sasidhar, Jana and Murali form a huge chunk of my great days in Corvallis! I am also greatly indebted for all the help I have received from: Ganesh Balamurugan (Intel), Thomas Toifl (IBM), Marcel Kossel (IBM), Christian Menolfi (IBM) and Srikanth Gondi (Silicon Image) during various phases of my PhD. I am grateful to Ferne for assisting me in many different ways with all administrative work.

My wife, Priya, patiently listened to all my ramblings about circuits (and never complained!). She continues to amaze me with her ability to adapt to my extremely fluctuating work hours. Lastly, and most importantly, I wish to thank my parents (Narendra and Padma), and my brother (Mahesh). They receive my deepest gratitude for all their endless love, encouragement and moral support through all these years. To them, I dedicate this thesis.

### TABLE OF CONTENTS

|   |      | —                                                                                                                                                                        |                      |
|---|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| 1 | INTI | RODUCTION                                                                                                                                                                | 1                    |
|   | 1.1  | Motivation                                                                                                                                                               | 1                    |
|   | 1.2  | Thesis Organization                                                                                                                                                      | 2                    |
| 2 |      | ALLING AND CLOCKING TECHNIQUES FOR HIGH SPEED                                                                                                                            | 3                    |
|   | 2.1  | Review of High Speed Serial Links                                                                                                                                        | 3                    |
|   | 2.2  | Clocking                                                                                                                                                                 | 6                    |
|   | 2.3  | Signalling                                                                                                                                                               | 9                    |
|   | 2.4  | Clock Recovery                                                                                                                                                           | 14                   |
|   | 2.5  | Trends in High Speed Serial Links                                                                                                                                        | 18                   |
|   |      | <ul> <li>2.5.1 Technology limitations on achievable data rates</li> <li>2.5.2 Clocking Schemes</li> <li>2.5.3 Signalling Format</li> <li>2.5.4 Clock Recovery</li> </ul> | 18<br>22<br>24<br>26 |
| 3 | SER  | IAL TRANSCEIVER USING CURRENT-RECYCLING                                                                                                                                  | 28                   |
|   | 3.1  | Introduction                                                                                                                                                             | 28                   |
|   | 3.2  | Low power techniques for serial transceivers                                                                                                                             | 28                   |
|   |      | <ul><li>3.2.1 Switched Regulation</li></ul>                                                                                                                              | 28<br>29<br>31       |
|   | 3.3  | Proposed Current Recycling Approach                                                                                                                                      | 32                   |
|   | 3.4  | Proposed System Architecture                                                                                                                                             | 34                   |
|   | 3.5  | Implementation Details                                                                                                                                                   | 37                   |
|   |      | 3.5.1 Transmit PLL                                                                                                                                                       | 37                   |

# TABLE OF CONTENTS (Continued)

|   |     | <ul> <li>3.5.2 Low Voltage Digitally Controlled Oscillator (DCO)</li> <li>3.5.3 Integral Path DAC</li> <li>3.5.4 Voltage-mode Transmitter</li> <li>3.5.5 Receiver Front-end</li> <li>3.5.6 CDR : FLL and PLL loops</li> </ul> | 38<br>39<br>39<br>41<br>43 |
|---|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|
|   | 3.6 | Measurement Results                                                                                                                                                                                                           | 45                         |
| 4 | REF | ERENCE-LESS CLOCK AND DATA RECOVERY                                                                                                                                                                                           | 54                         |
|   | 4.1 | Introduction                                                                                                                                                                                                                  | 54                         |
|   | 4.2 | Conventional Clock and Data Recovery Circuit Architectures                                                                                                                                                                    | 55                         |
|   | 4.3 | Proposed Reference-less Half-rate CDR Architecture                                                                                                                                                                            | 57                         |
|   | 4.4 | Proposed Frequency Detector                                                                                                                                                                                                   | 60                         |
|   | 4.5 | Improving tolerance to input duty-cycle error                                                                                                                                                                                 | 67                         |
|   | 4.6 | Circuit Design                                                                                                                                                                                                                | 74                         |
|   |     | <ul><li>4.6.1 Digitally Controlled Oscillator (DCO)</li><li>4.6.2 Linear Digital-to-Delay Conversion</li></ul>                                                                                                                | 74<br>77                   |
|   | 4.7 | Measured Results                                                                                                                                                                                                              | 81                         |
| 5 | MUL | TIMODE SOURCE-SYNCHRONOUS RECEIVER                                                                                                                                                                                            | 91                         |
|   | 5.1 | Introduction                                                                                                                                                                                                                  | 91                         |
|   | 5.2 | Proposed Architecture                                                                                                                                                                                                         | 92                         |
|   | 5.3 | Implementation Details                                                                                                                                                                                                        | 93                         |
|   |     | <ul><li>5.3.1 Reconfigurable slicer bank : Threshold and clocking</li><li>5.3.2 Symbol-rate phase recovery</li><li>5.3.3 Low power phase rotating PLL</li></ul>                                                               | 93<br>93<br>95             |
|   | 5.4 | Measured results                                                                                                                                                                                                              | 97                         |

## TABLE OF CONTENTS (Continued)

|   |     |             |   | _   |
|---|-----|-------------|---|-----|
| 6 | CON | ICLUSION    | 1 | 103 |
|   | 6.1 | Conclusions | 1 | 103 |
|   | 6.2 | Future work | 1 | 104 |

| BIBLIOGRAPHY | 105 |
|--------------|-----|
|              |     |

### LIST OF FIGURES

### Figure

| 2.1  | $\rm NRZ/RZ$ data waveforms and PSD's for a random pattern $\ldots\ldots$             | 4  |
|------|---------------------------------------------------------------------------------------|----|
| 2.2  | Building blocks of serial link transceiver                                            | 4  |
| 2.3  | Sampling of incoming data eye by phase locking using $\mathrm{CK}_{\mathrm{RX,EDGE}}$ | 6  |
| 2.4  | (a) Full (b) Half (c) Quarter rate clock and data waveforms                           | 7  |
| 2.5  | Duty cycle distortion (DCD) in half-rate transmitter                                  | 8  |
| 2.6  | Classification of link based on clocking scheme employed                              | 8  |
| 2.7  | Simplified model for signalling in a serial link                                      | 10 |
| 2.8  | PAM4 signalling scheme                                                                | 10 |
| 2.9  | Comparison of eye-diagrams for NRZ and PAM4 signalling schemes                        | 11 |
| 2.10 | Simultaneous bidirectional (SBD) signalling scheme                                    | 12 |
| 2.11 | Implementation of Simultaneous bidirectional (SBD) signalling                         | 13 |
| 2.12 | Comparison of eye-diagrams for NRZ and SBD signalling schemes                         | 13 |
| 2.13 | NRZ to Duobinary encoding                                                             | 14 |
| 2.14 | Comparison of eye-diagrams for NRZ and Duobinary signalling schemes                   | 15 |
| 2.15 | VCO based CDR                                                                         | 16 |
| 2.16 | Phase interpolation based CDR                                                         | 16 |
| 2.17 | Phase interpolation                                                                   | 18 |
| 2.18 | Predicted and literature data for FO4 delay at different process lengths              | 19 |
| 2.19 | Achieved bitrates at different process nodes                                          | 20 |
| 2.20 | FOM (mW/Gbps) vs process length                                                       | 21 |
| 2.21 | FOM vs year reported in literature                                                    | 21 |

Figure

| 2.22 | Data signalling rate plotted as a function of maximum on-chip clock frequency                  | 23 |
|------|------------------------------------------------------------------------------------------------|----|
| 2.23 | FOM in full/half/quarter rate links                                                            | 24 |
| 2.24 | FOM in mesochronous/plesiochronous receiver based serial links .                               | 25 |
| 2.25 | FOM versus channel attenuation for various signalling schemes                                  | 26 |
| 2.26 | FOM vs process length for VCO based and PI based serial links .                                | 27 |
| 3.1  | Switched Regulator generating supply voltage for transceiver                                   | 29 |
| 3.2  | Linear Regulator generating $\alpha V_{DD}$                                                    | 30 |
| 3.3  | Current recycling between stacked logic circuits                                               | 31 |
| 3.4  | Current recycling between stacked $unbalanced$ circuits in a PLL [1]                           | 32 |
| 3.5  | Conventional plesiochronous transceiver with <i>similar</i> blocks high-<br>lighted            | 33 |
| 3.6  | Proposed transceiver with current recycling among <i>clock genera-</i><br><i>tion</i> circuits | 33 |
| 3.7  | Proposed transceiver architecture with current recycling                                       | 34 |
| 3.8  | Transmitter implementation in the proposed transceiver                                         | 35 |
| 3.9  | Receiver implementation in the proposed transceiver                                            | 36 |
| 3.10 | Transmit PLL implementation                                                                    | 37 |
| 3.11 | Low voltage DCO implementation                                                                 | 38 |
| 3.12 | Integral path implementation                                                                   | 40 |
| 3.13 | Transmitter implementation                                                                     | 40 |
| 3.14 | Impedance locked loop that ensures the output termination is $50\Omega$                        | 42 |
| 3.15 | Receiver front-end implementation                                                              | 43 |
| 3.16 | CDR FLL Implementation                                                                         | 43 |

| Figu |                                                                              | Pag | ge |
|------|------------------------------------------------------------------------------|-----|----|
| 3.17 | CDR PLL Implementation                                                       |     | 44 |
| 3.18 | Die Micrograph                                                               |     | 45 |
| 3.19 | Measured power dissipation in the <i>stacked</i> clock generation circuit    | its | 46 |
| 3.20 | Measured efficiency of dc-dc conversion due to current recycling.            | •   | 46 |
| 3.21 | Measured transmit PLL jitter at 1.6GHz                                       | •   | 47 |
| 3.22 | Measured transmit PLL phase noise at 1.6GHz                                  | •   | 48 |
| 3.23 | Measured PLL jitter generation across entire operating range                 |     | 48 |
| 3.24 | Measured transmitter eye diagrams (3.2Gbps) at different swing levels        |     | 49 |
| 3.25 | Measured BER at the receiver input                                           |     | 50 |
| 3.26 | Measured recovered clock jitter and half-rate data stream                    |     | 50 |
| 3.27 | Measured power dissipation of the serial transceiver at different data-rates |     | 51 |
| 4.1  | Block diagram of a conventional CDR using a reference clock                  |     | 55 |
| 4.2  | Block diagram of a reference-less CDR                                        | •   | 56 |
| 4.3  | Proposed CDR architecture.                                                   |     | 58 |
| 4.4  | Digital loop filter implementation                                           |     | 59 |
| 4.5  | RFD transfer characteristic                                                  | •   | 61 |
| 4.6  | Transition density in alternating and random data patterns                   |     | 62 |
| 4.7  | 10-bit counter output clocked with alternating data                          | •   | 62 |
| 4.8  | 10-bit counter output clocked with random data                               |     | 63 |
| 4.9  | MSB of the 10-bit counter output                                             |     | 64 |
| 4.10 | Implementation of the high-speed 10-bit counter                              |     | 64 |
| 4.11 | PSD at the output of the first three divide-by-2 stages                      |     | 65 |

| Figu | line                                                                                                                                                                                                                                    | Page  |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
| 4.12 | PSD after ten divide-by-2 stages                                                                                                                                                                                                        | . 66  |
| 4.13 | Block diagram of the FLL and the schematic of the frequency detector.                                                                                                                                                                   | . 67  |
| 4.14 | Origin of duty-cycle error : Half-rate transmitter with and with-<br>out clock duty-cycle error                                                                                                                                         | . 68  |
| 4.15 | Data/Edge sampling clocks before and after calibration in the presence of data duty-cycle error.                                                                                                                                        | . 69  |
| 4.16 | Required phase shift on sampling clocks I,Q,Ib,Qb with input data duty-cycle error = $\alpha$ UI.                                                                                                                                       | . 70  |
| 4.17 | Possible locking scenarios in a half-rate CDR                                                                                                                                                                                           | . 70  |
| 4.18 | Re-configured CDR to reduce the number of lock points                                                                                                                                                                                   | . 71  |
| 4.19 | Data duty-cycle estimator                                                                                                                                                                                                               | . 72  |
| 4.20 | Simulation results demonstrating that slope of accumulation of $(E-L)_{EVEN}$ is indicative of the sign of the input data duty-cycle error when $(E,L)_{EVEN}$ corresponds to a (a) <i>always early</i> and (b) <i>always late</i> case | . 73  |
| 4.21 | Convergence of the clock-phase calibration algorithm                                                                                                                                                                                    | . 73  |
| 4.22 | Digitally controlled oscillator (DCO).                                                                                                                                                                                                  | . 75  |
| 4.23 | $\Delta\Sigma$ DAC in both FLL and PLL paths                                                                                                                                                                                            | . 75  |
| 4.24 | FLL controlled tuning curve of the DCO                                                                                                                                                                                                  | . 76  |
| 4.25 | Digital integral path (in PLL) controlled tuning curve of the DCC                                                                                                                                                                       | D. 77 |
| 4.26 | Conventional digital-to-delay converter (DDC)                                                                                                                                                                                           | . 78  |
| 4.27 | Linearized digital-to-delay converter (DDC).                                                                                                                                                                                            | . 78  |
| 4.28 | Simulated DDC characteristic.                                                                                                                                                                                                           | . 79  |
| 4.29 | Schematic of all the 4 DDCs used to phase shift $I/Q/Ib/Qb.\ldots$                                                                                                                                                                      | . 80  |
| 4.30 | Die micrograph.                                                                                                                                                                                                                         | . 80  |

# Figure

| 4.31 | Measured PSD's of first three divide-by-2 stages with 2Gbps ran-<br>dom data.                     | 81  |
|------|---------------------------------------------------------------------------------------------------|-----|
| 4.32 | Measured FLL offset with different input PRBS pattern                                             | 82  |
| 4.33 | FLL/PLL clock spectrums with PRBS10 and PRBS15 input patterns                                     | 83  |
| 4.34 | Edge sampling clock phases I/Ib before and after calibration with 20% duty-cycle error.           | 84  |
| 4.35 | Recovered clock jitter with a PRBS10 input pattern                                                | 85  |
| 4.36 | Random NRZ data stream "1100010" generated from impulse stream p(t)                               | 88  |
| 4.37 | PSD of the first two divider stages for a 2 Gbps random input data.                               | 90  |
| 5.1  | Architecture of multi-mode receiver                                                               | 92  |
| 5.2  | Signal eye diagrams and slicer threshold and clocking arrangement<br>in each mode                 | 94  |
| 5.3  | Symbol rate phase recovery                                                                        | 95  |
| 5.4  | Low power phase rotating PLL.                                                                     | 96  |
| 5.5  | Die Micrograph.                                                                                   | 98  |
| 5.6  | PLL jitter and phase noise at 1.6GHz                                                              | 98  |
| 5.7  | 64 discrete phase steps in the phase rotating PLL                                                 | 99  |
| 5.8  | Measured receive eye diagrams at 3.2Gbps while transmitting on<br>the medium and long FR4 traces. | 100 |
| 5.9  | Measured receive eye diagrams at 3.2Gbps while transmitting on<br>the medium and long FR4 traces  | 101 |

### LIST OF TABLES

| Ta  | ble                                    | Pa  | ge  |
|-----|----------------------------------------|-----|-----|
| 3.1 | Performance Comparison                 |     | 52  |
| 4.1 | Digital CDR Performance Summary        | ••• | 86  |
| 4.2 | Performance Comparison                 | ••• | 87  |
| 5.1 | Multimode receiver performance summary |     | 102 |

# HIGHLY DIGITAL POWER EFFICIENT TECHNIQUES FOR SERIAL LINKS

### **CHAPTER 1. INTRODUCTION**

### 1.1 Motivation

Aggressive scaling of CMOS technology has resulted in a sustained growth of integrated circuit density and speed. CMOS scaling accompanied by increased circuit complexity has lead to extremely fast and capable computing systems which demand high off-chip data bandwidth. High speed serial communication techniques are widely adopted to address this bandwidth requirement. These techniques are widely used in applications such as CPU-memory, CPU-peripheral, network interfaces, backplane and optical links. The achievable data rates in a serial transceiver are limited either by transistor feature size in a given technology or by the channel bandwidth. Though techniques to deal with band-limited channels have been well established by using equalization, achieving high data rates and low bit error rate (BER) coupled with a tight power budget ( $<5 \,\mathrm{mW/Gbps}$ ) continues to be a significant challenge.

The main focus of the dissertation is to explore design techniques that leverage digital implementation of conventional analog circuits to address the requirements of high speed serial transceivers capable of operating over a wide range of data rates with a stringent power budget (mW/Gbps).

### 1.2 Thesis Organization

The thesis is organized as follows:

Chapter 2 provides a review of signalling and clocking techniques widely adopted in serial transceivers. Design trends are analyzed using survey data.

Chapter 3 presents a transceiver design that uses *current recycling* to reduce power dissipation. Chapter 4 illustrates a clock and data recovery circuit that uses *referenceless frequency detector* with unlimited frequency acquisition range. Extracting a reference tone from the incoming random data obviates the need for an external crystal oscillator thereby reducing external component count and system cost. Chapter 5 presents a source synchronous receiver capable of resolving multiple signalling formats. The receiver has a reconfigurable slicer bank that is optimized to resolve NRZ, RZ, Duobinary and PAM4 signalling formats. Phase locking in the receiver is achieved by using a very low power XOR based phase rotating PLL.

Finally, the dissertation is concluded in Chapter 6 by providing a summary of the contributions and directions for further research.

# CHAPTER 2. SIGNALLING AND CLOCKING TECHNIQUES FOR HIGH SPEED SERIAL TRANSCEIVERS

This chapter is organized as follows. Section 2.1 provides a review of nonreturn to zero (NRZ) data signalling and building blocks of a high-speed serial transceiver. Classification of serial links[2, 3, 4, 5] is broadly based on the (a) clocking scheme used, (b) signalling techniques employed, (c) clock recovery at the receiver. Section 2.5 presents the survey data by examining the technology limitations on achievable data rates and power dissipation trends in serial link transceivers. Current design trends in each of the serial-link subtasks is also dealt with while analyzing trends in the survey data.

#### 2.1 Review of High Speed Serial Links

Non-return to zero (NRZ) and return to zero (RZ) modes are commonly used formats for binary data transmission. Fig.2.1 shows transmitted waveforms for NRZ/RZ data for a known bit pattern "1001". Every bit transmitted requires  $T_b$  seconds or one unit interval (1 UI). Examining the power spectral density (PSD) for a long binary random sequence comprising 0/1's with equal transition density, it can be shown[3] that the spectra of the NRZ and RZ data are

$$S_{\rm NRZ}(f) = T_{\rm b} \left[ \frac{\sin(\pi f T_{\rm b})}{\pi f T_{\rm b}} \right]^2, \ S_{\rm RZ}(f) = \frac{T_{\rm b}}{2} \left[ \frac{\sin(0.5\pi f T_{\rm b})}{0.5\pi f T_{\rm b}} \right]^2$$
(2.1)



1 unit interval (UI) =  $T_b$  seconds



Figure 2.1: NRZ/RZ data waveforms and PSD's for a random pattern

Closely examining the NRZ data spectrum, it can be seen that the first spectral null is at  $f=\frac{1}{T_b}$ , whereas the RZ data has it at  $f=\frac{2}{T_b}$ . A larger spread in the PSD of RZ data mandates the need for a channel with larger bandwidth, thereby making NRZ the most preferred format for binary data transmission.



Figure 2.2: Building blocks of serial link transceiver

Fig.2.2 shows the basic building blocks of a serial link transceiver. To mitigate the effect of channel attenuation at higher frequencies, equalizers can be used at the transmit (Tx) and/or receive (Rx) end. The frequency response of the equalizer is high-pass in nature, thereby nullifying the low-pass characteristic of the channel[5].

The transmitter (Tx) consists of four important blocks - the transmit phase locked loop (TxPLL), serializer, transmit equalizer, and transmit driver. The TxPLL generates a high frequency on-chip clock using an external crystal reference. The serializer multiplexes the data word input into a serial stream using TxPLL clock output. The transmit driver implemented using either current mode logic (CML)[6] or voltage mode (VM)[7], outputs the serialized data onto the channel. Static power dissipation associated with CML based drivers has been effectively addressed using VM drivers, but at the cost of complex equalization scheme in VM domain. The substantially low impedance of the driver helps in achieving high channel bandwidth, but paying penalty in terms of high power dissipation in the transmit driver to achieve voltage swings in the order of hundreds of mV. To avoid reflections due to the transmission line behavior of the channel, 50  $\Omega$  termination resistor is used at the receive end. The receiver (Rx) consists of three basic blocks - the receive equalizer, clock recovery unit and data samplers (also called *slicers*). The clock recovery unit and the data samplers together are referred to as the clock and data recovery (CDR) unit. The samplers are typically implemented as high speed regenerative latches[8].

For maximum voltage and timing margins in the data sampler at the Rx end, the sampling clock  $CK_{RX,DATA}$  should be at the center of the eye[9], as shown in Fig.2.3. The clock recovery unit aligns to the center of the data eye diagram by aligning  $CK_{RX,EDGE}$  to the data edge and using 180° phase shifted clock i.e.  $CK_{RX,DATA}$  for sampling the data. Since two samples corresponding to each bit



Figure 2.3: Sampling of incoming data eye by phase locking using  $CK_{RX,EDGE}$ 

(data sample (DS) and edge sample (ES)) are being used, it is often called 2X oversampling technique[10]. Higher the oversampling ratio used, better the BER performance but at the cost of additional power dissipation in the samplers[11]. Based on system requirements, deserialization might be used at the receive end. The section that follows provides a review of classification of links based on clocking, signalling, equalization and clock recovery.

### 2.2 Clocking

Based on the relative switching rates of data (D) and clock (CK), majority of the links are either full, half, or quarter rate architectures. Choosing a multi-rate (half, quarter, or lower) reduces the maximum on-chip clock frequency[12], but at the cost of generating multiple phases to be able to multiplex data at the transmit end and de-multiplex data at the receive end. Fig.2.4 illustrates the data and clock switching relationship in the most commonly used schemes. The primary issue limiting the use of multi-rate clocking scheme is *duty-cycle distortion* (DCD). In



Figure 2.4: (a) Full (b) Half (c) Quarter rate clock and data waveforms

multi-rate architectures, transmit data inherits the duty-cycle of the transmit clock as shown in Fig.2.5 (not 50% practically), mandating use of duty cycle correction circuits[13], [14]. It can be shown that the transfer characteristic of the Alexander phase detector will have a dead zone in the presence of data DCD.

Another classification of serial links is based on generation of the receiver clock [15]. These main categories are shown in Fig.2.6. If the link has a extra channel to transmit the clock to the receiver, it is called a forwarded clock architecture or "source synchronous" type. If the transmitter and receiver generate their respective clocks from different local crystal oscillators, it is called an embedded clock architecture or "plesiochronous" type. Another category of links are based on "reference-less" clocking, where the receiver derives the sampling clock by use



Figure 2.5: Duty cycle distortion (DCD) in half-rate transmitter



Figure 2.6: Classification of link based on clocking scheme employed

of frequency detectors for random data[16, 17, 18]. Reference-less architecture is employed serial link implementations which cannot afford either an extra channel for clock forwarding, or a crystal oscillator at the receive end.

The most important distinction between source synchronous and plesiochronous schemes is that the plesiochronous receivers must account for small differences in transmit and receive frequencies, due to mismatch in the crystal oscillator references at the transmit and receive ends (typically measured in *ppm*, parts per million). The two most important advantages with using source synchronous clocking[15] are (a) Transmit jitter on the data (D) and clock (CK) are highly correlated, thereby improving the *jitter tolerance* of the receiver (b) Reduces CDR complexity when frequency modulation schemes (e.g. spread spectrum clocking) are employed at the transmit end.

#### 2.3 Signalling

The most commonly used signalling schemes are (a) binary (also called PAM2 or NRZ), (b) 4-level pulse amplitude modulation (PAM4) (c) simultaneous bidirectional signalling (SBD), and (d) Duobinary signalling. Binary is the regular NRZ format or equivalently 2 level PAM (1 or 0 transmission). Understanding the different signalling schemes is made easier by using a highly simplified view of a serial transceiver shown in Fig.2.7. The transmitter can be visualized as a 1-bit digital to analog converter (DAC) and the receiver as a 1-bit analog to digital converter (ADC).

PAM4 signalling[19] scheme shown in Fig.2.8 relies on 2-bit DAC/ADC approach which allows us to code 2 bits of information into 4 discrete levels. PAM4 reduces the signalling frequency by 2, since 2 UI requirement for transmission of 2



Figure 2.7: Simplified model for signalling in a serial link



Figure 2.8: PAM4 signalling scheme

bits in NRZ is only 1 UI in PAM4 scheme. Reduced signalling frequency implies operating the serial link at a lower loss region of the channel. From the eyediagrams in the Fig.2.9, it can be seen that there is 2X increase in the eye width (timing margin) and one-third or  $20\log_{10}(3) = 9.54$  dB decrease in the eye height (voltage margin) compared to regular NRZ signalling. An important observation from the PAM4 eye diagram is that there are multiple edge-sampler "lock" points unlike NRZ. Technique to ensure phase lock for maximum timing margin has been addressed in [20].



Figure 2.9: Comparison of eye-diagrams for NRZ and PAM4 signalling schemes

Simultaneous bidirectional signalling (SBD) relies on increasing the throughput of a channel by making it fully bidirectional[21]. Assuming  $D_{A,TX}$  and  $D_{B,TX}$ 



Figure 2.10: Simultaneous bidirectional (SBD) signalling scheme

can take values of either '1' or '0', the voltage on the channel is a function of the transmit and receive bits,  $D_{A,TX}$  and  $D_{B,TX}$ , as shown in Fig.2.10. Considering the transmitter on either ends of a SBD link as 1-bit current DAC, there are 3 possible voltage levels on the channel. With prior knowledge of the transmit bit  $D_{A,TX}$ , the receive bit  $D_{B,TX}$  on the transmit end can be obtained using a comparator with switched thresholds as shown in Fig.2.11. The eye diagrams of the NRZ and SBD signalling schemes are compared in Fig. 2.12. SBD has a small reduction in the timing margin and  $20\log_{10}(2) = 6 \text{ dB}$  decrease in the voltage margin, but offers twice the thoroughput on the same channel compared to NRZ.

Duobinary signalling relies on encoding the transmit data into a 3-level bit stream such that the PSD of the data has the first spectral null at  $\frac{1}{2T_b}$ , thereby reducing the bandwidth requirement of the channel by half. Example of a NRZ bit



Figure 2.11: Implementation of Simultaneous bidirectional (SBD) signalling



Figure 2.12: Comparison of eye-diagrams for NRZ and SBD signalling schemes

stream converted into duobinary is shown in Fig. 2.13. The eye diagram comparison between NRZ and duobinary signalling schemes is shown in Fig. 2.14. There is no change in timing margin and 6 dB decrease in voltage margin. By comparing with Fig. 2.9 and 2.12, it should be noticed that the multiple lock points problem is not present in duobinary signalling.



Figure 2.13: NRZ to Duobinary encoding

#### 2.4 Clock Recovery

The two widely used approaches for clock recovery are shown in Fig. 2.15 and Fig. 2.16 : voltage controlled oscillator (VCO) based and phase interpolation based. In VCO based approach, the phase of a single clock output  $CK_{RX}$  is adjusted to the center of the incoming data eye. The VCO based approach has two loops, one to setup the frequency to the incoming data rate using frequency detector (FD) and the other loop to adjust the sampling clock phase to the center of the eye using phase detector (PD). The lock detect signal is used to transition from frequency acquisition loop to phase loop.

In the phase interpolation based approach [22], a finite state machine (FSM) is used to select the optimal phase from the multiple phases generated by a phase



Figure 2.14: Comparison of eye-diagrams for NRZ and Duobinary signalling schemes



Figure 2.15: VCO based CDR



Figure 2.16: Phase interpolation based CDR

locked loop (PLL) or delay locked loop (DLL). Phase interpolation based approach suffers from quantization error between phases. Phase interpolator, and feedback phase selection architectures have dealt with techniques to reduce the effect of quantization error [23, 24, 25, 26].

The basic principle behind phase interpolation (PI) [23] is demonstrated in Fig. 2.17. Addition of two sine waves (which are 90° out of phase) defined as  $Xsin(\omega t)$  and  $Ycos(\omega t)$  yields  $Rsin(\omega t+\phi)$ . It can be seen that the phase of the resultant sine wave is a function of the amplitudes of the two inputs. Implementation of phase interpolators relies on implementing amplitude DAC's (digital to analog converters) which control the amplitudes X,Y. Phase shifts in the range of  $[0,\frac{\pi}{2}]$  by making either Y or X = 0. Extending the range of the PI can be achieved using multiple phases e.g.  $Xsin(\omega t)$ ,  $Xsin(\omega t + \pi) = -Xsin(\omega t)$ ,  $Ycos(\omega t)$ ,  $Ycos(\omega t + \pi)=-Ycos(\omega t)$  resulting in an effective phase output range of  $\pm\pi$ . Implementing a differential two delay cell voltage controlled oscillator (VCO) or a single ended four stage VCO with weak cross coupling between even stages will give all the required 4 phases [27, 28].

Since the clock waveforms are non-sinusoidal, *slew rate* (measured in V/ps) plays a very important role in realizing the PI. Another issue with realizing phase interpolators is linearity. Since the output phase depends on an *arctan* function, it should be noticed that the gain defined w.r.t. each of the amplitude DAC's is given by

$$\frac{\partial \phi}{\partial X_{dac}} = \frac{-Y_{dac}}{X_{dac}^2 + Y_{dac}^2}$$
$$\frac{\partial \phi}{\partial Y_{dac}} = \frac{X_{dac}}{X_{dac}^2 + Y_{dac}^2}$$

It should be observed from above equations that the small signal gain is not constant but is a highly "non-linear" function of amplitude DAC settings.



Figure 2.17: Phase interpolation

### 2.5 Trends in High Speed Serial Links

A comprehensive survey of literature data from ISSCC and VLSI Symposium was used to study trends in the design of high speed serial links and stand alone transmitters/receivers.

#### 2.5.1 Technology limitations on achievable data rates

Technology limitation on the achievable data rate can be obtained using the fanout-of-4 (FO4) inverter delay metric. The reason behind using FO4 delay as a metric to characterize any technology is that the delay of most digital gates when normalized to the FO4 inverter delay (in that technology) tends to stay constant

irrespective of the technology. A common thumb-rule to estimate the FO4 delay of a given process is

#### FO4 delay $[ps] = 0.5 \times Process length [nm]$

This thumb-rule is used to estimate FO4 delay in most CMOS processes [2, 29], and its validity has been verified using data from literature as shown in Fig. 2.18.



Figure 2.18: Predicted and literature data for FO4 delay at different process lengths

Using bit periods lower then FO4 delay tends to lead to pulse amplitude closure. Using the survey data, the achieved bitrates in Gbps (for complete links, stand-alone transmitters and receivers) vs the process length are plotted in Fig. 2.19. It should be noted that most of the designs in a given technology are limited by the theoretical upper-bound "1 FO4" bitrate curve and lower-bound by the "32 FO4" bitrate curve.



Figure 2.19: Achieved bitrates at different process nodes

A detailed study of the designs achieving bitrates better than the bound defined by 1 FO4 were (a) majorly stand-alone receivers [30, 31, 32, 33, 34] which were all multi-rate architectures, and (b) transmitters which were PAM4 based [19, 35], which have an effective 2 UI eye width compared to NRZ signalling scheme. Only one of the links [36] was full-rate transmitter based, which used R+L loaded CML drivers (for clock routing, transmitter output) to achieve high bandwidth, also suggesting that the highest achievable speed in the link being limited by transmitter bandwidth. Most receiver implementations are relying on multi-rate approach to relax the speed requirement of the slicers.

Scaling of CMOS technology has relied on the "constant-field" approach (holds true till 90 nm gate lengths) and it can be shown that the dynamic powerdelay product (PDP) of an inverter shows a cubic relation to the channel length



Figure 2.20: FOM (mW/Gbps) vs process length



Figure 2.21: FOM vs year reported in literature

L [37, 38, 4]. PDP is a measure of the energy spent per transition. Similar to the power delay product of the inverter, serial links adopt a figure of merit (FOM) which measures the energy consumption per bit transition (measured in mW/Gbps or pJ/bit).

$$FOM = \frac{Power \ consumption \ [mW]}{Bitrate \ [Gbps]}$$

Fig. 2.20 shows the FOM (only complete links) vs process length and it can be seen that power dissipation in serial links has been dominated by dynamic power dissipation because the average power dissipation across each process node fits well into the cubic relation. Using the FOM trend vs year plot from Fig. 2.21, it has been observed that there is approximately 2X decrease in average FOM every 2 years i.e.  $\approx 30\%$  cumulative decrease every year. The best design FOM is reducing at approximately 25% every year. The best designs (with lower FOM) are shown with larger markers. Designs achieving FOM < 10 mW/Gbps are [39, 28, 14, 40, 41, 42, 43, 44]. The major novelties of these designs were (a) use of low power voltage mode driver, (b) software based CDR where edge samplers are operated at very low frequency, (c) passive equalization using R+L loading and (d) transmit swing scaling with data rate.

#### 2.5.2 Clocking Schemes

To find out the predominantly used clocking architecture, the maximum bitrate was plotted vs the maximum on-chip clock frequency. As shown in Fig. 2.22, majority of the designs reported are full rate. An important observation from this graph is that at high data rate requirements, the tendency has been to use half/quarter rate. The *density* of half/quarter rate markers are dominant at higher data rates. This can be explained using the fact that the frequency of full-rate clock is twice higher than half-rate clock, i.e. on-chip clock generation and distribution tends to become an issue at higher data rates in both the transmitter and receiver.

Using Fig. 2.23, it was deduced that the power efficient designs (shown in large markers) in a given technology are multi-rate (mostly half-rate). The only design exception in the plot which is full rate based and has the best reported power FOM in a given process (0.25  $\mu$ m process) is [45]. The power efficiency of this design can be attributed to the use of simultaneous bidirectional signalling technique, in a 244 cell I/O chip.



Figure 2.22: Data signalling rate plotted as a function of maximum on-chip clock frequency

Another classification in clocking was the forwarded clock (mesochronous) and embedded clock (plesiochronous) schemes. As shown in Fig. 2.24, majority of the designs reported in literature are mesochronous, and most of the technology's



Figure 2.23: FOM in full/half/quarter rate links

best FOM designs also are mesochronous links (shown in large markers). The logical explanation for the lower FOM is because the CDR unit is now relieved of the frequency detection task, but has only phase detection to accomplish.

## 2.5.3 Signalling Format

Widely use signalling schemes include NRZ, PAM4 and SBD. From Fig. 2.25, it can be seen that majority of the links use NRZ signalling scheme in the wake of high channel attenuation. PAM4 and SBD have been restricted to channels with attenuation lower than 12 dB due to reduced timing and voltage margins inherent to the signalling scheme, and requirements of a complicated equalization scheme. Though previous literature states that "benefits can of PAM4 over NRZ can be



Figure 2.24: FOM in mesochronous/plesiochronous receiver based serial links

reaped only when channel loss changes by more than 9.5 dB due to 50% frequency reduction in PAM4 signalling [46]", the viewpoint has been contradicted with the argument that "PAM4 signaling is 3X more sensitive to un-compensated ISI and crosstalk than NRZ as the peak signal to error threshold ratio is 3 times higher in PAM4 than NRZ [47]".

In other words, BER performance of a serial link being a weighted function of the voltage and timing Q-functions, 3X decrease in voltage margin and 2X increase (practically, it is around 1.67X due to non-adjacent transitions as seen in Fig. 2.9) in timing margin will not yield similar BER performance compared to NRZ, assuming input referred voltage noise of the data samplers in NRZ and PAM4 is the same. The penalty paid for achieving similar performance is increased power dissipation in the PAM4 data samplers.



Figure 2.25: FOM versus channel attenuation for various signalling schemes

### 2.5.4 Clock Recovery

From Fig. 2.26 shows the FOM of serial links using VCO based and phase interpolation (PI) based clock recovery schemes. Majority of designs reported in literature are phase interpolation based schemes. Also best FOM links in a given technology happen to be phase interpolation based schemes. This can be justified using the fact that most PI based schemes have one global clock generator at the receiver end and the clock generation power is amortized across multiple receivers in the same chip. Using the same approach in VCO based links is not possible because each receiver needs a dedicated VCO inside the loop for clock and data recovery.



Figure 2.26: FOM vs process length for VCO based and PI based serial links

# CHAPTER 3. SERIAL TRANSCEIVER USING CURRENT-RECYCLING

# 3.1 Introduction

Serial link transceivers are widely employed in all high-speed digital communication systems. These transceivers must fulfill the requirements of low power consumption while operating over a wide frequency range. Plesiochronous mode of operation is desirable as it offers the flexibility in the choice of local crystal oscillators. In addition to the above features, a fully integrated solution with on-chip loop filters for both the PLL and CDR helps reduce external component count. The entire focus of our work is to reduce power dissipation without compromising other desirable features. With this goal, a brief review of low power techniques reported in prior literature are discussed.

#### **3.2** Low power techniques for serial transceivers

#### 3.2.1 Switched Regulation

Based on the frequency of operation of the serial transceiver, switched regulation has been used to generate the optimal supply voltage  $V_{REG}$  in this prior design [48]. As shown in Fig. 3.1, negative feedback in the switched regulator loop forces the reference ring oscillator to oscillate at a frequency equal to  $F_{REF}$ . This



Figure 3.1: Switched Regulator generating supply voltage for transceiver

implies that the oscillator is biased at a voltage at which the serial transceiver can be operated. Though adaptive supply regulation is an efficient technique for dc-dc conversion, it needs external inductors and capacitors, rendering the requirement to achieve a fully integrated solution impractical. Furthermore, a wide operating data-rate range translates to a wide voltage range on the regulator output. This implies an increased design effort to ensure that all the transceiver blocks operate over a wide supply voltage range. Linear regulation is an attractive alternative minimizing the need for external components

## 3.2.2 Linear Regulation

One of the simplest ways to reduce power of any switching circuit is to operate them at a lower supply voltage. A low-dropout regulator is shown in Fig. 3.2. It can be used to generate the lower supply voltage, denoted as  $\alpha V_{DD}$ . In this figure, we can see that the two logic circuits operating at a supply voltage of  $\alpha V_{DD}$  draw



Figure 3.2: Linear Regulator generating  $\alpha V_{DD}$ 

currents  $I_1$  and  $I_2$  respectively. The total current  $I_1+I_2$  has to flow through the PMOS device which leads to power loss, denoted as  $P_{loss}$ . The efficiency of this conversion process can be calculated as the ratio of the power delivered to the total power in the circuit. This is given by

Efficiency [%] = 
$$\frac{P_{delivered}}{P_{delivered} + P_{loss}}$$
  
=  $\frac{\alpha V_{DD}(I_1 + I_2)}{\alpha V_{DD}(I_1 + I_2) + (1 - \alpha) V_{DD}(I_1 + I_2)} = \alpha$ 

In this topology, the efficiency turns out to be  $\alpha$ , which is the fraction of the supply voltage at which we are operating. The efficiency of this linear regulation technique is poor when trying to reduce power dissipation by lowering the supply voltage. To overcome the problem of reduced efficiency in linear regulation and need for external inductors and capacitors in switched regulation techniques, current recycling was proposed as a solution [49] to avoid these problems.



Figure 3.3: Current recycling between stacked logic circuits

Fig. 3.3 show the current recycling between two switching logic circuits that are stacked. Current  $I_1$  drawn by the top switching circuit is recycled in the bottom half. The bottom circuit needs a current  $I_2$  and the *push-pull regulator* either sources or sinks the difference in currents  $I_1$  and  $I_2$ . Since the regulator now only needs to account for the difference current, this technique has a very high efficiency. In the case where both the currents are exactly equal, the conversion process is 100% efficient.

In a prior design [1], current recycling technique was used in a fractional N frequency synthesizer. As shown in Fig. 3.4, the LC VCO and the fractional N divider are stacked. While power reduction was demonstrated, matching the currents of two completely different blocks is challenging. In our work, to guarantee excellent current matching under all conditions, we propose to stack two identical



Figure 3.4: Current recycling between stacked *unbalanced* circuits in a PLL [1]

blocks in a serial link transceiver and demonstrate high power efficiency.

# 3.3 Proposed Current Recycling Approach

Fig. 3.5 shows the implementation of a conventional plesiochronous serial transceiver. The transmitter portion is shown on the top and the receiver is at the bottom. Drawing the reader's attention to the highlighted clocking circuits, it should be observed that on the transmitter side a PLL generates the high frequency clock and on the receiver side an identical loop generates the recovered clock. Because they use identical blocks and operate at identical frequencies we concur that they consume the same power. This makes them the most suitable candidates for current recycling. Because clocking circuits consume a large portion of the link power, this choice of stacking greatly helps in improving power efficiency.

Fig. 3.6 illustrates the different voltage domains in which each of the blocks of our transceiver work. Current drawn by the receive frequency locking loop denoted as RxFLL is recycled in the transmit PLL (TxPLL). As a result, the



Figure 3.5: Conventional plesiochronous transceiver with *similar* blocks highlighted



Figure 3.6: Proposed transceiver with current recycling among *clock generation* circuits

TxPLL operates between 0 to 0.6V rails and the RxFLL operates between 0.6 to 1.2V rails. Rest of the transmitter and receiver circuits operate between 0 to 1.2V rails.

#### **3.4** Proposed System Architecture



Figure 3.7: Proposed transceiver architecture with current recycling

The proposed transceiver architecture is shown in Fig. 3.7. Highly digital circuits are used to achieve low voltage operation and overcome the large area requirements of the loop filter capacitors in traditional analog implementations. The implementation details will be presented in the forthcoming section.

The highlighted part in Fig. 3.8 indicates the transmitter portion of the serial



Figure 3.8: Transmitter implementation in the proposed transceiver

link. Shown at the bottom, is a digital PLL operating between 0 and 0.6V rails. It is a ring oscillator based PLL and generates a 50% duty cycle clock at half the data rate. The clock outputs are level shifted to 1.2V swings and provided to the data pattern generator. The pattern generator is capable of generating PRBS7 and PRBS31 patterns in addition to an alternating data sequence. Half rate pattern generator output is multiplexed to full rate and a low-power voltage mode driver transmits the full-rate data onto the channel.

The receiver portion of the serial link is highlighted in Fig. 3.9. Shown on the right is a continuous time linear equalizer (CTLE) that provides high frequency boost and mitigates ISI introduced by the short PCB trace. A half rate bang-bang phase detector (!!PD) generates 3-level phase error information by performing early/late detection on 4 samples of the equalized data. A digital loop



Figure 3.9: Receiver implementation in the proposed transceiver

filter processes the phase error and drives the DCO towards phase lock. Prior to phase-locking, the frequency locking loop denoted as RxFLL, is responsible for initial frequency acquisition in the DCO. The RxFLL operates between 0.6 to 1.2V rails. Similar to the transmitter, the DCO output clocks are level shifted to full swing before feeding to the samplers. The recovered half-rate data is verified using an on-chip PRBS verifier.



Figure 3.10: Transmit PLL implementation

#### 3.5 Implementation Details

#### 3.5.1 Transmit PLL

The block diagram of the transmit PLL is shown in Fig. 3.10. It is multiplyby-4 PLL which operates between 0 to 0.6V rails and generates a 0.25 to 2 GHz half rate clock. A conventional 3-state PFD directly drives the DCO and implements the proportional control portion of the PI filter. Even though the PFD outputs are digital pulses, they contain linear phase error information and do not introduce any phase quantization error in the proportional path. Consequently, this approach eliminates all the detrimental effects of finite TDC resolution in conventional digital PLLs. In the integral path, a flip flop acts as a bang-bang phase detector and performs early late detection on UP and DOWN outputs of the PFD. The flip flop drives a digital accumulator and implements the integral control. The large quantization error of the flip-flop is suppressed by the low bandwidth of the integral path. This low bandwidth also guarantees heavily over-damped response. The digitally controlled oscillator is the most critical block in the PLL. Its design at very low voltage poses many challenges. The next section discusses the design details of a very low voltage wide operating range DCO.

3.5.2 Low Voltage Digitally Controlled Oscillator (DCO)



Figure 3.11: Low voltage DCO implementation

A two stage ring oscillator using pseudo-differential delay cells is employed in this design. The schematic of the DCO is shown in Fig. 3.11. Operating with a 0.6V supply this oscillator achieves a tuning range of 0.1 to 2.7 GHz in simulations. A delta-sigma DAC is used to convert the digital integral control word to an analog voltage. Transistors M3 and M4 convert the DAC output voltage to a current and control the oscillation frequency. In steady state, the integral control brings the oscillator to the desired frequency  $F_{nom}$  as shown in the graph on the right. The proportional control is implemented by controlling the DCO frequency directly with the PFD outputs. Corresponding to the 3 output states of the PFD i.e. UP, DN and no change, the VCO frequency takes on 3 values,  $F_{nom}+F_{UP}$ ,  $F_{nom}-F_{DN}$ and  $F_{nom}$  respectively. This behavior is illustrated in the plot shown on the right. Because of the heavily over-damped response of the digital PLL, its bandwidth solely depends on the proportional path gain. Under this condition, the PLL bandwidth can be shown to be given by .

PLL Bandwidth [Hz] 
$$= \frac{0.5(F_{UP} + F_{DN})}{2\pi N}$$

The PLL bandwidth is set to the desired value by adjusting the magnitude of the proportional frequency steps  $F_{UP}$  and  $F_{DN}$ .

# 3.5.3 Integral Path DAC

Fig. 3.12 shows the implementation of the integral path. The 14-bit DAC is realized using a digital delta sigma modulator and a 4 bit R-2R DAC. The digital modulator is implemented using a second order error feedback structure. It truncates the 14-bit integral control word to 4 bits. These 4 bits are converted to an analog voltage using a R-2R DAC shown here. Leveraging the constant output impedance of the R-2R structure, the shaped quantization noise from the digital modulator is filtered using a passive second order low pass filter.

#### 3.5.4 Voltage-mode Transmitter

Fig. 3.13 shows the block diagram of the transmitter used in the proposed prototype. Half rate random data generated by the PRBS generator is converted



Figure 3.12: Integral path implementation



Figure 3.13: Transmitter implementation

to full rate by a 2:1 multiplexer. The full rate data is buffered by a pre-driver and then transmitted using a low power voltage mode driver. The pre-driver output supply voltage that ensures  $50\Omega$  output impedance of the voltage mode driver is implemented using an impedance locked loop.

An impedance locked loop (shown in Fig. 3.14) operates in the phase domain and sets the predriver supply voltage to achieve 50 $\Omega$  output impedance. The impedance locked loop implemented as a simple phase locked loop tunes the gate voltage of M1' transistor and matches the phase shift in the R<sub>T</sub>-C and M1'-C paths. The square wave input is generated by switching the input of the impedance locked loop between V<sub>SW</sub> and ground. A StrongArm flip flop acts as a bang-bang phase detector and a RC filter is used to suppress ripple in the control voltage. As a result of matching the voltage mode driver transistor M1 to M1' and limiting the output swing of the predriver to V<sub>PDRV</sub>, 50 $\Omega$  termination impedance is achieved. Because there is no static current, the proposed tuning consumes lower power compared to conventional voltage-based schemes.

#### 3.5.5 Receiver Front-end

Fig. 3.15 shows the receiver front-end circuitry. The channel output is AC coupled and the desired input common mode voltage is set using on-chip termination resistors. A source-degenerated continuous time linear equalizer (CTLE) provides high frequency boost and accounts for moderate channel loss. Four StrongArm flip-flops are used to sample equalized signal and the resulting data and edge samples are processed by phase detector to determine phase error in the form of early or late (E/L) signals.



Figure 3.14: Impedance locked loop that ensures the output termination is  $50\Omega$ 



Figure 3.15: Receiver front-end implementation

# 3.5.6 CDR : FLL and PLL loops



Figure 3.16: CDR FLL Implementation

The CDR has two important loops the frequency locking loop denoted as FLL and the phase locking loop denoted as PLL. The main purpose of the FLL is to bring the DCO close to desired frequency before phase-locking using the E/L decisions from the phase detector. The receive FLL operating between 0.6 and 1.2 V rails is shown in Fig. 3.16. The architecture is the same as the TxPLL except that the PFD is replaced with a cycle slip detector, denoted as CSD in the diagram. High degree of similarity between the TxPLL and RxFLL helps to achieve good current-recycling efficiency.



Figure 3.17: CDR PLL Implementation

The PLL portion of the CDR is shown in Fig. 3.17. After the FLL brings the DCO close to the incoming data rate frequency, phase locking is achieved by using the E/L decisions from the bang-bang phase detector. Residual frequency error between the FLL locked frequency and the incoming data rate is accounted for by the CDR integral path control word  $D_{\Delta F}$ .

## 3.6 Measurement Results

The prototype link is fabricated in a 1.2V 90nm CMOS process. Its die micrograph is shown in Fig. 3.18. The active area is 0.66mm<sup>2</sup> and the die was packaged in a 64-pin TQFP package.



Figure 3.18: Die Micrograph

The plot in Fig. 3.19 shows the measured power dissipation in the stacked portion of clock generation blocks versus the output frequency. Current drawn by the TxPLL operating out of 0-0.6V rails and RxFLL operating out of 0.6-1.2V rails was measured across the entire operating range of 0.25 to 2GHz. At 2GHz, the total power in both the blocks combined is only around 1.8mW. The efficiency of current recycling which translates to the dc-dc conversion efficiency is shown in Fig. 3.20. Good matching between the currents consumed in the TxPLL and the



Figure 3.19: Measured power dissipation in the *stacked* clock generation circuits



Figure 3.20: Measured efficiency of dc-dc conversion due to current recycling



RxFLL leads to better than 90% efficiency across the entire operating range.

Figure 3.21: Measured transmit PLL jitter at 1.6GHz

The transmit PLL clock jitter at 1.6GHz is shown in Fig. 3.21. The long term accumulated jitter measured over one hundred thousand hits is 4.65  $ps_{rms}$  and 40  $ps_{pk-pk}$ . The measured phase noise plot at 1.6 GHz is shown in Fig. 3.22. At 1 MHz offset, the digital PLL achieves a phase noise of -106 dBc/Hz and the rms jitter integrated from 1 kHz to 1 GHz is 4.4ps<sub>rms</sub>.

The transmit PLL jitter generation across the entire link operating range is shown in Fig. 3.23. The normalized long-term rms jitter is less than 1% until 1GHz output frequencies. The best case transmit PLL jitter measured is at 2 GHz. At this frequency, the transmit PLL jitter is 3.5 ps<sub>rms</sub> and 27.6 ps<sub>pk-pk</sub>. At lower frequencies, jitter is dominated by the DCO frequency quantization error. This is mainly due to the noise leakage from the  $\Delta\Sigma$ -DAC and could be reduced either by increasing its sampling frequency or using a lower cutoff post filter.



Figure 3.22: Measured transmit PLL phase noise at 1.6GHz



Figure 3.23: Measured PLL jitter generation across entire operating range



Figure 3.24: Measured transmitter eye diagrams (3.2Gbps) at different swing levels

The transmitter eye diagram measured at 3.2Gbps is shown in Fig. 3.24. When the voltage mode output driver is set to transmit a 100 mVpp,diff, the measured swing at the receiver input is 69mV and the horizontal eye opening is 267ps. Excluding the transmit PLL jitter contribution, the eye opening indicates an additional 5  $ps_{pk-pk}$  jitter induced by the transmitter driver circuitry. When the output swing is set to 200 mVpp, the measured vertical eye opening at the receiver is 143mV and there is negligible reduction in the horizontal eye opening. This negligible reduction with a large swing indicates that reasonably well matched termination is achieved at different output swing levels.

The measured BER curve obtained by sweeping the recovered clock phase across the received eye is shown Fig. 3.25. The CDR operates with no bit errors between sampling clock phases of 0.3 and 0.7 UI, resulting in an effective error free eye opening of 0.4UI.

The recovered half rate 1.6GHz clock and one of the 1.6Gbps half-rate data streams are shown in Fig. 3.26. The measured recovered clock jitter shown on the right indicates a jitter of 11.7  $ps_{rms}$  and 69  $ps_{pk-pk}$ .



Figure 3.25: Measured BER at the receiver input



Figure 3.26: Measured recovered clock jitter and half-rate data stream



Figure 3.27: Measured power dissipation of the serial transceiver at different datarates

The measured power dissipation of the transceiver at 4 different data rates is shown in Fig. 3.27. The total power when operating at 0.5Gbps is 3.3 mW and when operating at 4Gbps it is 6.9mW. At low data rates, the power dissipation is dominated by the voltage mode driver in the transmitter and CTLE in the receiver. Current recycling in the clock generation circuits helped reducing the power dissipation at all frequencies.

|                    | JSSC'07 [28]         | JSSC'08 [14]       | This Work [50]        |
|--------------------|----------------------|--------------------|-----------------------|
| Technology         | $90 \mathrm{nm}$     | $65 \mathrm{nm}$   | $90 \mathrm{nm}$      |
| Supply voltage     | 1V(Fixed)            | 0.68 - 1.05 V      | 1.2V(Fixed)           |
| Data Rate          | $6.25 \mathrm{Gbps}$ | 5-15Gbps           | $0.5-4 \mathrm{Gbps}$ |
| Implementation     | Half-rate            | Half-rate          | Half-rate             |
| Clocking           | Mesochronous         | Source-synchronous | Plesiochronous        |
| Link FOM [mW/Gbps] | 2.2                  | 2.7                | 1.9                   |

Table 3.1: Performance Comparison

Table. 3.1 compares this work to two low-power serial transceiver designs in literature. The first design by Palmer [28] operating at a fixed data rate achieves an excellent FOM of 2.2mW/Gbps. By opting for a mesochronous mode of operation, the authors reduced the receiver power dissipation by periodically turning off edge samplers and amortizing the transmit PLL power among multiple transceivers. The second design by Balamurugan [14] demonstrated a wide operating range transceiver but resorted to the use of a variable supply voltage. Also, source synchronous nature of this transceiver mandates the need for an additional high speed clock lane. The presented design operates with a fixed 1.2V supply voltage and is capable of operating in the plesiochronous mode. The achieved power efficiency is  $1.9 \mathrm{mW/Gbps.}$ 

In conclusion, current recycling serves as an attractive approach to implementing highly efficient DC-DC conversion in the context of serial links. We presented low voltage digital PLL and CDR architectures that are particularly amenable for stacking. Improved power efficiency comes at the cost of reduced flexibility of having to operate the building blocks at different voltage levels. Finally, the measured results obtained from the prototype low power serial link validate the proposed ideas.

# CHAPTER 4. REFERENCE-LESS CLOCK AND DATA RECOVERY

# 4.1 Introduction

Clock and data recovery (CDR) is a critical task in all serial communication systems. CDR circuits are employed in a wide range of applications including optical transceivers, chip-to-chip interconnects and backplane communications. The main objective of the CDR is to recover the data from the serial input bit stream in an error-free, power and cost-efficient manner. CDRs with wide frequency acquisition range offer flexibility in optical communication networks, help reduce link power through activity-based rate adaptation, and minimize cost with a single-chip multi-standard solution. Extracting the bit rate from the incoming random data stream is the main challenge in implementing such reference-less CDRs.

In this chapter, we present a reference-less half-rate CDR that uses a subharmonic extraction method to achieve unlimited frequency acquisition range. This technique is capable of locking the CDR to within 40ppm of any sub-rate of the data (making it applicable for any sub-rate CDR architecture), while being immune to undesirable harmonic locking. This CDR also integrates a calibration loop to improve robustness to input duty-cycle error.

The chapter is organized as follows. Section 4.2 provides a brief overview of conventional reference-based and reference-less CDR architectures. The proposed CDR architecture is outlined in Section 4.3 and the two important contributions of the design – a) reference-less frequency acquisition and b) clock phase calibration in the presence of input data duty-cycle error are presented. Details of the reference-less frequency acquisition technique are described in Section 4.4. Section 4.5 describes the origins of input data duty-cycle error and a clock phase calibration technique for optimal sampling of the incoming random data stream. Circuit design details of the important building blocks of the CDR are presented in Section 4.6. Section 4.7 summarizes the measurement results from the test chip.

# 4.2 Conventional Clock and Data Recovery Circuit Architectures



Figure 4.1: Block diagram of a conventional CDR using a reference clock.

A conventional reference-based clock and data recovery circuit is shown in Fig. 4.1. It consists of a frequency-locked loop denoted as FLL and a phaselocked loop denoted as PLL. A reference clock generated by a crystal oscillator is used in the FLL to drive the VCO frequency towards the desired data rate. After the initial frequency acquisition, the PLL achieves phase lock and the VCO clock is driven to the center of the incoming data eye. Two separate loop filters,  $LP_{FLL}$  and  $LP_{PLL}$ , independently set the loop dynamics of the FLL and PLL, respectively. While a reference-based CDR simplifies the design, the need for a crystal oscillator incurs additional cost. Furthermore, the CDR operating range is limited to only a few discrete frequencies dictated by the divider ratio and available crystal frequencies. A reference-less CDR obviates the need for an additional clock source and is capable of operating continuously over a wide range of frequencies.



Figure 4.2: Block diagram of a reference-less CDR.

The block diagram of a reference-less CDR is shown in Fig. 4.2. In this architecture, the FLL drives the VCO towards frequency lock by directly extracting the frequency error from the incoming random data  $D_{IN}$ . Consequently, this architecture can operate continuously over a wide range of data rates. The biggest challenge in implementing such a reference-less CDR is the design of a frequency

detector that is capable of extracting any error between VCO frequency and the input data rate. In practice, the CDR's frequency acquisition range is typically limited by the frequency detector. For instance, the detection range of a commonly used rotational frequency detector is limited to about 50% of the data rate [51, 16]. Additionally, none of the conventional architectures are amenable for sub-rate (e.g. half-rate) operation[16, 52, 53, 54, 18, 55, 56, 57]. In this chapter, we present a frequency detector that has unlimited detection range and is also suitable for sub-rate CDR operation.

Analog loop filters used in both reference-based and reference-less CDRs also pose several implementation difficulties. Very large capacitors in the order of nano farads are needed to meet the low jitter transfer bandwidth and small peaking requirements mandated by many communication standards. Such capacitors when implemented on chip consume a large area, are PVT sensitive and when implemented using high density MOSCAPs are prone to leakage. Consequently, external capacitors are often used to overcome these issues [18]. In this work, we seek to use a fully integrated digital loop filter (DLF) to overcome the drawbacks of analog loop filter.

#### 4.3 Proposed Reference-less Half-rate CDR Architecture

The proposed digital CDR architecture is shown in Fig. 4.3 [58]. It is composed of a frequency locking loop (FLL), a phase-locking loop (PLL), and a data duty-cycle correction loop (DCCL). The FLL consisting of a frequency detector (FD), an accumulator (ACC), and a  $\Delta\Sigma$  digital-to-analog converter (DAC) drives the digitally controlled oscillator (DCO) towards frequency lock. When the frequency error is within the PLL's pull-in range, it acquires phase lock. Because both the FLL and the PLL are driven by the incoming data, they can operate simultaneously, hence obviating the need for a lock detector. However, to prevent any interaction between the two loops, the FLL bandwidth is made much smaller than that of the PLL.



Figure 4.3: Proposed CDR architecture.

The digital PLL consists of a half-rate bang-bang phase detector (!!PD), a digital loop filter (DLF) to drive the DCO. The DLF is designed by transforming the analog loop filter into digital domain as shown in Fig. 4.4 [59]. In conventional analog implementations, loop filter is a simple series connected resistor-capacitor network with the voltage across the resistor and the capacitor implementing the proportional and integral control, respectively. In the digital loop filter, proportional and integral control paths are implemented separately and the summing of

the two control signals is performed in analog domain. This topology eliminates the need for one adder and reduces the dithering jitter by minimizing latency in the proportional path [60]. The bang-bang phase detector compares input data edges



Figure 4.4: Digital loop filter implementation.

and the DCO clock phases and generates two pairs of early/late (E/L) decisions. The (E/L) signals drive the DCO through a 5-level current-mode DAC (+2I,+I,0,-I,-2I) and implement the proportional control. Because the phase-locking portion of the proposed CDR is identical to a conventional bang-bang CDR, its jitter tolerance (JTOL) profile would be similar to that of a conventional CDR.

The integral path consists of an accumulator followed by a high-resolution DAC. The circuit complexity of a high-resolution DAC is relaxed by using a low-resolution DAC driven by a digital  $\Delta\Sigma$  modulator [61, 62]. The highly digital loop filter is PVT insensitive and is fully synthesized in our design. The E/L data are decimated by a factor of 16 before feeding to a low frequency accumulator.

The duty-cycle error in the transmit clock of a half-rate transmitter manifests itself as unequal widths of the transmitted even/odd data eyes. Sampling this incoming data with equally spaced clock phases (0.5UI in a half-rate receiver) reduces receiver timing margin. To alleviate this, the digital DCCL estimates the input data duty-cycle error and calibrates the DCO clock phases for optimal sampling of both the even and odd data bits. At start up, the early/late decisions of a conventional half-rate bang-bang phase detector are analyzed to deduce information about the *sign* of incoming random data duty-cycle error. The DCCL utilizes a very low bandwidth Type-I loop to adjust the delay on each of the 4 clock phases I/Q/Ib/Qb and improves timing margin.

#### 4.4 Proposed Frequency Detector

A conventional rotational frequency detector (RFD) and its variants reported in [16, 52, 53, 18] rely on *sampling* the oscillator clock phase with the incoming random data to determine the frequency error. This sampling phenomenon causes nulls in the RFD transfer characteristic as shown in Fig. 4.5, and limits the acquisition range to about  $\pm 50\%$  of the data rate. Further, RFD based frequency acquisition loops require a full-rate clock with I/Q phases rendering this frequency detector to be power hungry and useless for sub-rate CDR topologies. In this work, we propose a frequency detector which extracts a low frequency sub-harmonic clock from the incoming data rate. Using the extracted tone as a reference in a digital FLL locks the oscillator frequency to the data rate. The proposed FLL can be used to implement any sub-rate CDR architecture.

Before presenting the proposed frequency detection scheme, it is instructive to evaluate the transition probabilities of alternating and random data patterns. As shown in Fig. 4.6 for alternating data, the probability of both  $1\rightarrow 0$  and  $0\rightarrow 1$ transitions is equal to 0.5. Being a clock pattern, a  $1\rightarrow 1$  or a  $0\rightarrow 0$  transition never occurs and the associated probability is therefore 0. On the other hand, in



Figure 4.5: RFD transfer characteristic.

a random data pattern, all the four possible transitions are equally likely, with a probability of 0.25. An important observation is to note that, compared to the alternating data case, the probability of  $1\rightarrow 0$  and  $0\rightarrow 1$  transitions is exactly halved. We use this fact along with a digital accumulator to implement a frequency detector with unlimited acquisition range.

To understand how the process of accumulation helps in frequency extraction, consider the case of a 10-bit counter clocked by 2 different inputs - alternating data and random data. Since the counter counts only on the positive edge, the time taken to count from 1 to 1024 is 2048 unit intervals (UI). In other words, when clocked with alternating data counter roll-over happens every 2048UI, as shown in Fig. 4.7. Now consider the case in which the same counter is clocked with random data. Because the transition density in random data compared to alternating data



Figure 4.6: Transition density in alternating and random data patterns.



Figure 4.7: 10-bit counter output clocked with alternating data.

is exactly halved as discussed earlier, the accumulation rate is also halved. As a result, the counter roll-over happens every 4096UI, as shown in Fig. 4.8. Deviation in the transition density from the ideal value of 0.25 in the random data leads to a proportional error in the elapsed time to reach a count of 1024. As explained later, this error will introduce a fixed offset at the output of the proposed frequency detector. It can be shown that the ripple in the frequency of the extracted sub-harmonic exhibits an inverse dependence on the divider value. Though choosing a large divider aids in reducing the ripple in frequency, it results in a low sub-harmonic extracted reference frequency which leads to a larger acquisition time.



Figure 4.8: 10-bit counter output clocked with random data.

A closer look at the counter output in Fig. 4.9 reveals that the most significant bit (MSB) toggles every 2048UI. In view of this, the 9 least significant bits of the counter output are discarded and the most significant bit, referred to as

the *extracted sub-harmonic* henceforth, is used as the reference clock in the FLL for frequency locking. The counter is implemented as a cascade of 10 divide-by-2 stages as shown in Fig. 4.10. This realization is not only simple but also consumes little power. The simplicity of this structure makes it very useful for sub-harmonic extraction even at higher data rates also.



Figure 4.9: MSB of the 10-bit counter output.



Figure 4.10: Implementation of the high-speed 10-bit counter.

Despite the fact that random data has spectral nulls at all integer multiples of  $F_B$ , passing it through a chain of dividers extracts a sub-harmonic of  $F_B$ .

This counter intuitive result can be better understood by reexamining the frequency detector behavior in the frequency domain. After passing the random data through one divide-by-2 stage, the resulting output PSD is given by the expression  $S_{x_{div2}}(\omega) \approx \frac{4F_B}{\omega^2} \left(\frac{1}{4} - \frac{1}{8}\cos(\frac{2\omega}{F_B}) - \frac{1}{8}\cos(\frac{3\omega}{F_B}) - \frac{1}{16}\cos(\frac{4\omega}{F_B})\right)$ . The plot of this expression is shown in Fig. 4.11 (The methodology for deriving the expressions for the spectral density of the NRZ random data and the output of the first divide-by-2 stage is derived in the Appendix). Plotting this reveals that there is an increase in power at  $\frac{F_B}{8}$  frequency. Adding another divide-by-2 stage, the output of the effective divide-by-4 stage now peaks at  $\frac{F_B}{16}$ . We make an important observation that each divide-by-2 stage causes a 6dB increase in the output PSD peak.



Figure 4.11: PSD at the output of the first three divide-by-2 stages.

Following the same line of argument, a divide-by-1024 stage shapes the input random data PSD to a tone at  $\frac{F_B}{4096}$ , as shown in Fig. 4.12. Intuitively, a 2<sup>10</sup> divider

acts as an asynchronous modulo- $2^{10}$  counter with its output toggling whenever the number of low-to-high data transitions reaches  $2^{10}$ . Therefore, for binary random data with equal low-to-high and high-to-low transitions, the average frequency of the divider output is equal to  $\frac{0.5F_B}{2^{10+1}}$ . Having discussed the proposed sub-harmonic extraction scheme in detail, the implementation details of the frequency locking loop (FLL) are presented next.



Figure 4.12: PSD after ten divide-by-2 stages.

The block diagram of the FLL is shown in Fig. 4.13. A conventional counting type frequency detector is used in this implementation. Frequency error is determined by finding the difference between the number of VCO periods in adjacent reference periods. The VCO clock is divided by 16 before feeding it into the 14-bit counter to relax counter speed requirements. Being a half-rate CDR architecture, the FLL must frequency lock the DCO to  $\frac{F_B}{2}$ . Because extracted reference clock is

 $\frac{F_B}{4096}$ , under locked condition, there will be  $\frac{4096}{32} = 128$  divided DCO clock periods in adjacent periods of the reference clock. Deviation of the counter output from 128 is the measure of frequency error. A cascade of two registers (REG) is used to perform 1-z<sup>-1</sup> operation and the resulting frequency error is fed to the digital loop filter. The digital loop filter is composed of a digital accumulator whose output drives the DCO.



Figure 4.13: Block diagram of the FLL and the schematic of the frequency detector.

## 4.5 Improving tolerance to input duty-cycle error

Input duty-cycle error in a half-rate CDR reduces the timing margins and degrades it performance. Figure 4.14 illustrates the origins of the data dutycycle error. When the transmitter is clocked with a 50% duty-cycle clock, the resulting transmit output eye is symmetric and both the odd and even eyes have the same width. But in the presence of a duty-cycle error in the transmit clock, the transmitter output inherits the clock's duty-cycle error. This manifests itself as an asymmetric transmitter eye as shown in the bottom half of Fig. 4.14.



Figure 4.14: Origin of duty-cycle error : Half-rate transmitter with and without clock duty-cycle error.

When the asymmetric eye is fed to the CDR, the steady state sampling points are sub optimal as depicted in Fig. 4.15. Note that the four phases, I/Q/Ib/Qb are equally spaced as they are generated by a ring oscillator with nominally identical delay stages. In this work, we seek to calibrate each of the individual phases to maximize timing and voltage margins as shown in the bottom half of Fig. 4.15. The amount of phase shift required in the wake of input data duty-cycle error is best understood with an example. Consider the case where input data eye has  $\alpha$  UI duty-cycle error as shown in Fig. 4.16 i.e. odd eye is  $\alpha$  UI larger than the even eye. Under this condition, the clock phases are separated by 0.5 UI while the desired phase spacing must be as shown in column 3 of the table in Fig. 4.16. The new phase spacing guarantees that both Q and Qb phases are positioned at the center of the odd and even eyes at  $0.5(1+\alpha)$  UI and  $1.5+0.5\alpha$  UI, respectively.



Figure 4.15: Data/Edge sampling clocks before and after calibration in the presence of data duty-cycle error.

Hence, to ensure optimal sampling, Q/Qb and Ib need to shifted by  $0.5\alpha$  UI and  $\alpha$  UI, respectively.

Figure 4.17 shows the details of the calibration scheme that drives the quadrature phases to optimal sampling positions. In the presence of an input duty-cycle error, there are 4-possible lock points in a half-rate CDR. The two edge sampling clocks I/Ib can lock to either the smaller or the larger eye, creating 4 different locking scenarios. In cases 1 and 2, phase I is locked to the edge while in cases 3 and 4, phase Ib is locked to the edge. While it is possible to adjust the phase spacing of each of the 4 phases independently, it is cumbersome and requires a complicated algorithm to perform the calibration. The calibration logic can be simplified by forcing the CDR to lock to only 2 of the 4 possibilities, as discussed next.

Consider the case in which the early/late decisions corresponding to the even



Figure 4.16: Required phase shift on sampling clocks I,Q,Ib,Qb with input data duty-cycle error =  $\alpha$  UI.



Figure 4.17: Possible locking scenarios in a half-rate CDR.

data input are ignored, as shown in Fig. 4.18. Because the phase error information corresponding to only the odd data bit is used, phase I always locks to the data edge, thus eliminating locking scenarios 3 and 4. We also note that in case 1, clock phases (Q/Ib/Qb) are always early while in case 2 the clocks (Q/Ib/Qb) are always late. This information of being *always* early or *always* late can be obtained by observing the unused even E/L pair. In other words,  $(E/L)_{EVEN}$  decisions indicate the sign of the duty-cycle error.



Figure 4.18: Re-configured CDR to reduce the number of lock points.

The block diagram of the data duty-cycle estimator is shown in Fig. 4.19. At start up, the CDR is locked only with the  $(E/L)_{ODD}$  decisions of the half-rate bangbang phase detector. The  $(E/L)_{EVEN}$  signals are decimated by 16 and accumulated using a low speed accumulator.



Figure 4.19: Data duty-cycle estimator

The direction of accumulation of the  $(E-L)_{EVEN}$  signal is indicative of the sign of the duty-cycle error. Simulation results for locking scenarios 1 and 2 are shown in Fig. 4.20. In the left half of the figure, the  $(E-L)_{EVEN}$  signal accumulates with a positive slope, indicative of an always early condition and the in the right half of the figure, the signal accumulates with a negative slope which is indicative of an always late condition. The slope detector block implemented as an accumulator followed by a first order differentiation,  $1-z^{-1}$ , determines the *sign* of the slope of accumulation. The sign of the duty-cycle error,  $\widehat{D}_E$ , is fed to an accumulator whose output control the digital-to-delay converters (DDC). Simulation results showing the convergence of the clock-phase calibration algorithm are shown in Fig. 4.21. Circuit implementation details of the DDC are presented in Section 4.6.



Figure 4.20: Simulation results demonstrating that slope of accumulation of  $(E-L)_{EVEN}$  is indicative of the sign of the input data duty-cycle error when  $(E,L)_{EVEN}$  corresponds to a (a) *always early* and (b) *always late* case.



Figure 4.21: Convergence of the clock-phase calibration algorithm.

#### 4.6 Circuit Design

The half-rate bang-bang phase detector is implemented using a conventional Alexander phase detector [63]. Improved sense-amplifier flip-flops are used as data and edge samplers [64]. To ease speed requirements in the digital integral path, the early/late samples are decimated by a factor of 16. The decimator is realized as a cascade of 4 decimate-by-2 stages. Each decimate-by-2 stage operates on consecutive early and late samples and performs a *signed* arithmetic subtraction and truncates the result to a 3 valued signal [-1,0,+1] [60, 65]. All the other digital circuit blocks are fully synthesized using standard cells. The design details of the analog building blocks namely the digitally controlled oscillator and the linear digital-to-delay conversion cells used in the duty cycle correction loop are presented next.

#### 4.6.1 Digitally Controlled Oscillator (DCO)

The schematic of the DCO is shown in Fig. 4.22. It is composed of a 4phase current-controlled ring oscillator and has three separate control ports. The FLL, proportional, and the integral control words denoted as  $D_{FLL}$ ,  $D_{PROP}$ , and  $D_{INT}$ , respectively, tune the oscillator frequency independently. The two pairs of early/late signals generated by the half-rate bang-bang phase detector constitute  $D_{PROP}$  and a 5-level DAC converts it into current.  $D_{FLL}$  and  $D_{INT}$  are the outputs of the respective accumulators in the FLL and the digital integral paths (see Fig. 4.3). Two 14-bit DACs convert  $D_{FLL}$  and  $D_{INT}$  into analog voltages which control the oscillator tuning currents. The schematic of the DAC is shown in Fig. 4.23.



Figure 4.22: Digitally controlled oscillator (DCO).



Figure 4.23:  $\Delta\Sigma$  DAC in both FLL and PLL paths.

A second order digital delta-sigma modulator truncates the 14-bit input to 4 bits and drives a thermometer-coded 15-level current-mode DAC (IDAC). The output current is converted into voltage by the load resistor and a second-order low pass post filter suppresses the shaped high frequency quantization error.



Figure 4.24: FLL controlled tuning curve of the DCO.

The simulated FLL-path tuning curve of the DCO is shown in Fig. 4.24. The FLL path is capable of tuning the DCO to frequencies ranging between 90 MHz to 1.7 GHz. After initial frequency acquisition, the PLL integral path accounts for drift in DCO frequency. Around 1.25 GHz, the PLL integral path is capable of tracking frequency variations of about  $\pm 35$  MHz as shown in Fig. 4.25. The DCO is highly susceptible to supply noise, and therefore its supply must be regulated to prevent deterministic jitter degradation of the CDR. In the prototype, a clean supply voltage provided by an external source is used.



Figure 4.25: Digital integral path (in PLL) controlled tuning curve of the DCO.

### 4.6.2 Linear Digital-to-Delay Conversion

As explained in Section 4.5 and illustrated in Fig. 4.3, the duty-cycle error estimator controls a bank of digital-to-delay conversion cells (DDC) to calibrate the edge and data sampling clock phase. A simple way to implement the DDC is by using a conventional current-starved inverter as shown in Fig. 4.26. Since the delay is inversely proportional to the control current, a DDC implemented by digitally scaling the current exhibits 1/x non-linearity. Two important issues arise from this non-linear control behavior. First, the large DDC gain at smaller inputs makes achieving good resolution challenging. Second, implementing  $\alpha$ UI and  $0.5\alpha$ UI delays needed in Q/Qb and Ib paths, respectively, by scaling the input digital control word becomes impossible. In view of these drawbacks, we propose a DDC architecture that seeks to eliminate this non-linearity.



Figure 4.26: Conventional digital-to-delay converter (DDC).



Figure 4.27: Linearized digital-to-delay converter (DDC).

Figure 4.27 shows the proposed linear digital-to-delay converter. By prewarping the control current by a 1/x function, the non-linearity is eliminated. To this end, a digitally controlled resistor is used to make the charging current vary inversely with the input digital control word,  $I_C \propto 1/D_C$ , thus linearizing the DDC transfer curve. The delay range can be adjusted by using an appropriate reference voltage,  $V_{\text{REF}}$ , and is chosen to cover up to  $\pm 20\%$  input data duty-cycle error. The simulated delay characteristics of the DDC shown in Fig. 4.28 illustrate the linear behavior and range variation with  $V_{\text{REF}}$ . Compared to the conventional DDC, the proposed DDC exhibits superior linearity without compromising the tunable delay range.



Figure 4.28: Simulated DDC characteristic.

The schematics of all the 4 delay cells used to phase shift clocks, I/Q/Ib/Qb, are shown in Fig. 4.29. Since the CDR locks phase I to the data edge, no calibration is needed for this phase, thus  $D_C$  is set to zero. As discussed earlier, another requirement of the clock phase calibration circuit was that the gain on clock phase Q/Qb be exactly half compared to that of Ib. Owing to the linear transfer characteristics of the proposed DDC, the desired gain scaling is achieved by bit-shifting the control word to the individual cells. Consequently, the control code to Q/Qband phases Ib is equal to  $D_C$  and  $2D_C$ , respectively.



Figure 4.29: Schematic of all the 4 DDCs used to phase shift I/Q/Ib/Qb.



Figure 4.30: Die micrograph.

## 4.7 Measured Results

The prototype digital CDR was implemented in a 1.2V  $0.13\mu$ m CMOS process. High-speed differential current-mode logic buffers are used to drive the onchip recovered half-rate data and clock for measurement purposes. The die micrograph is shown in Fig. 4.30 and the active area is  $0.39 \text{ mm}^2$ . The die was packaged in a TQFP48 package and tested using four-layer printed circuit board. The input PRBS sequences of varying lengths were generated using a arbitrary waveform generator (Tektronix AWG7000). Agilent E4440 spectrum analyzer and Tektronix Communication Signal Analyzer CSA8200 are used to analyze the spectral and jitter properties of the recovered clock.



Figure 4.31: Measured PSD's of first three divide-by-2 stages with 2Gbps random data.

A 2Gbps random data is fed to the CDR and the measured power spectral

density (PSD) plots at the output of the first three divide-by-2 stages is shown in Fig. 4.31. Passing the input data whose PSD has spectral nulls at all integer multiples of 2GHz through a divide-by-2 stage causes the PSD to peak at  $\frac{F_B}{8}$ , i.e 250 MHz and the PSD of the next two divide-by-2 stages peak at 125 and 62.5 MHz respectively. This spectral peaking measurement is consistent with the theoretical predictions from Appendix.



Figure 4.32: Measured FLL offset with different input PRBS pattern.

The dependence of the proposed frequency detector accuracy on input data transitions is characterized by measuring the frequency error with different input PRBS sequences and the results are presented in Fig. 4.32. The deviation of the transition density in PRBS patterns from the ideal value of 0.25 causes a frequency





Figure 4.33: FLL/PLL clock spectrums with PRBS10 and PRBS15 input patterns.

offset in the FLL. For a  $2^{N}$ -1 long PRBS sequence, the frequency offset is given by

Theoretical Offset [ppm] 
$$\approx \frac{10^6}{2^N}$$
 (4.1)

This expression matches closely with the measured frequency offset as shown in Fig. 4.32. Scrambling the input data with a polynomial,  $p(x)=1+x^{18}+x^{23}$ , brings the transition density to be closer to 0.25 and improves the accuracy to better than 100ppm, irrespective of the input sequence periodicity. The measured FLL and PLL clocks spectrums for PRBS10 and PRBS15 patterns are shown in Fig. 4.33. As discussed previously, the imbalance in PRBS10 and PRBS15 patterns leads to an FLL offset of 1000ppm and 100ppm, respectively.



Figure 4.34: Edge sampling clock phases I/Ib before and after calibration with 20% duty-cycle error.

The measurement in Fig. 4.34 shows the edge sampling clocks before and after calibration. Because of the 20% input data duty-cycle error, the width of the odd and even eyes are 600ps and 400ps, respectively. Under this condition, the edge sampling clock, Ib, is nominally at 500ps before calibration. After calibration,

Ib phase is phase shifted by about 100ps, thus bringing it very close to the edge transition of the smaller eye as desired. The distortion in the clock waveforms is caused by the multiplexer used to bring out multiple high speed signals. The recovered clock jitter for a 2Gbps PRBS10 pattern with 300mVpp differential swing is shown in Fig 4.35. The long term accumulated jitter of the recovered clock is 5.4ps,rms and 44ps,pk-pk. The measured BER is less than  $10^{-12}$ .



Figure 4.35: Recovered clock jitter with a PRBS10 input pattern.

Table 4.1 shows the performance summary of the CDR. The CDR implemented in a  $0.13\mu$ m CMOS process operates with supply voltage ranging from 0.8 to 1.2V. At 0.8V, the achieved data rate range is 0.2 to 1Gbps and at 1.2V it operates between 0.5 to 2.5Gbps. When operating at 2Gbps, the power in the FLL and PLL loops is 1.2 and 4.9mW, respectively. This CDR achieves a figureof-merit of 3.05 mW/Gbps. Table 4.2 compares the performance of the digital CDR with other state-of-the-art designs in the literature. This work demonstrates

| Technology                 | $0.13 \mu m CMOS$       |  |  |  |  |
|----------------------------|-------------------------|--|--|--|--|
| Supply voltage             | 1.2V                    |  |  |  |  |
| Data rate @ $V_{DD}$ =1.2V | 0.5Gbps-2.5Gbps         |  |  |  |  |
| Data rate @ $V_{DD}$ =0.8V | 0.2Gbps-1.0Gbps         |  |  |  |  |
| At 2Gbps                   |                         |  |  |  |  |
| BER                        | $< 10^{-12}$            |  |  |  |  |
| Recovered Clock            | $5.4 \mathrm{ps, rms}$  |  |  |  |  |
| Jitter (PRBS10 input)      | 44ps,pk-pk              |  |  |  |  |
| Power                      | $6.1\mathrm{mW}$        |  |  |  |  |
| FOM                        | $3.05 \mathrm{mW/Gbps}$ |  |  |  |  |
| Die area                   | $0.39 \mathrm{mm}^2$    |  |  |  |  |

a fully integrated, low power digital reference-less CDR with unlimited frequency acquisition range.

Table 4.1: Digital CDR Performance Summary

|                             | ISSCC05 [18]       | ISSCC09 [66]      | ISSCC06 [55]      | JSSC06 [56]       | ISSCC07 [67]      | This Work         |
|-----------------------------|--------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
| Technology                  | $0.13 \mu { m m}$  | $65 \mathrm{nm}$  | $0.25 \mu { m m}$ | $0.25 \mu { m m}$ | $0.13 \mu { m m}$ | $0.13 \mu { m m}$ |
| Supply voltage [V]          | 3.3                | 1.2               | 2.5               | 1.8               | 1.2               | 0.8/1.2           |
| Architecture                | Full-rate          | Full-rate         | Full-rate         | Full-rate         | Full-rate         | Half-rate         |
| Filter                      | Analog             | Analog            | Digital           | Analog            | Digital           | Digital           |
| Acquisition                 | Reference-less     | Reference-less    | Reference-less    | Reference-less    | Reference         | Reference-less    |
| Jitter $[ps_{rms}/ps_{pp}]$ | N/A                | 9.7/53.3          | 1.2/N/A           | 6.4/48.9          | 7.2/47.2          | 5.4/44            |
| Power                       | $775.5\mathrm{mW}$ | $20.6\mathrm{mW}$ | $425 \mathrm{mW}$ | $95 \mathrm{mW}$  | $13.2\mathrm{mW}$ | $6.1\mathrm{mW}$  |
|                             | @ 2.5 Gbps         | @ 0.65 Gbps       | @ 2.5 Gbps        | @ 3.125 Gbps      | @ 2.5 Gbps        | @ 2Gbps           |
| Power FOM [mW/Gbps]         | 310.2              | 31.7              | 170               | 30.4              | 4.72              | 3.05              |

 Table 4.2: Performance Comparison

## Appendix : Derivation of PSD at output of divide-by-2



Figure 4.36: Random NRZ data stream "1100010" generated from impulse stream p(t).

Random NRZ data signal, x(t), can be synthesized by passing a train of impulses, p(n), through an integrator as illustrated in Fig. 4.36. The impulse takes one of three possible values, +1, -1, and 0, corresponding to positive, negative, and no transitions in the random data. Using the two-sided Laplace transform, the power spectral densities (PSD) of x(t) denoted as  $S_x(\omega)$  and p(n) denoted as  $S_p(\omega)$  can be shown to be related follows,

$$S_x(\omega) = \frac{4S_p(\omega)}{\omega^2} \tag{4.2}$$

Using the Wiener-Khinchin theorem [68],  $S_p(\omega)$  can be calculated as,

$$S_p(\omega) = \frac{1}{T_B} \sum_{n=-\infty}^{\infty} R_p[n] e^{-j\omega n T_B}, \text{ where } R_p[n] = \mathbb{E}\left[p(k)p(k-n)\right]$$
(4.3)

Because the autocorrelation function of a real valued signal p(n) is an even function  $(R_p[n] = R_p[-n])$ , the above expression can be rewritten as,

$$S_p(\omega) = \frac{1}{T_B} \left( R_0 + 2\sum_{n=1}^{\infty} R_p[n] \cos(n\omega T_B) \right)$$
(4.4)

Combining Eq. 4.2 and Eq. 4.4 we get the PSD of the NRZ data to be,

$$S_x(\omega) = \frac{4S_p(\omega)}{\omega^2} = \frac{4F_B}{\omega^2} \left( R_p[0] + 2\sum_{n=1}^{\infty} R_p[n] \cos(n\omega T_B) \right)$$
(4.5)

Using Eq. (4.5),  $S_x(\omega)$  can be determined by evaluating the autocorrelation function  $R_p[n]$ . Recognizing that p(n) can take only one of three values, -1, 0, +1 with probabilities of  $\frac{1}{4}$ ,  $\frac{1}{2}$ , and  $\frac{1}{4}$ , respectively, the autocorrelation function can be calculated as follows:

$$\begin{aligned} R_p[0] &= \mathrm{E}\left[p(k)p(k)\right] \\ &= (-1 \times -1)P(p = -1) + (0 \times 0)P(p = 0) + (+1 \times +1)P(p = +1) \\ &= \frac{1}{4} + 0 + \frac{1}{4} = \frac{1}{2} \end{aligned}$$
$$\begin{aligned} R_p[1] &= \mathrm{E}\left[p(k)p(k + 1)\right] \\ &= (-1 \times -1)P(p_{k,k+1} = -1, -1) + (-1 \times +1)P(p_{k,k+1} = -1, +1) \\ &+ (+1 \times -1)P(p_{k,k+1} = +1, -1) + (+1 \times +1)P(p_{k,k+1} = +1, +1) \end{aligned}$$

Note that though there are 9 different consecutive bit combinations that affect  $R_p[1]$ , p(k) being 0 removes 5 terms ([0,+1],[0,0],[0,-1],[+1,0],[-1,0]). Also, the associated probabilities of  $P(p_{k,k+1} = -1, -1) = P(p_{k,k+1} = +1, +1) = 0$ for NRZ random data, since two transitions of the same type cannot happen in consecutive bit periods. Thus,

$$R_p[1] = (-1 \times +1)P(p_{k,k+1} = -1, +1) + (+1 \times -1)P(p_{k,k+1} = +1, -1)$$
$$= -(\frac{1}{2} \times \frac{1}{4}) - (\frac{1}{2} \times \frac{1}{4}) = -\frac{1}{4}$$

For n≥2, the amplitudes of p(k) and p(k+2) become independent, yielding  $R_p[n] = 0$ , when n≥2. Substituting  $R_p[0] = \frac{1}{2}$  and  $R_p[1] = -\frac{1}{4}$  into Eq. (4.5), yields  $S_x(\omega) = \frac{4F_B}{\omega^2} \left(\frac{1}{2} + (2 \times -\frac{1}{4})\cos(\omega T_B)\right) = \frac{4F_B}{\omega^2} \left(\frac{1}{2} - \frac{1}{2}\cos(\omega T_B)\right) = T_B \operatorname{sinc}^2(\frac{\omega T_B}{2})$ (4.6)

The signal p(n) when passed through a divide-by-2 circuit results in halving the number of  $1\rightarrow 0$  and  $0\rightarrow 1$  transitions. As a result, -1, 0, +1 occur with probabilities of  $\frac{1}{8}$ ,  $\frac{3}{4}$ , and  $\frac{1}{8}$ , respectively. Using the modified probabilities, the autocorrelation and the PSD of the divide-by-2 stage output,  $x_{div2}(t)$ , can be found to be,

$$R_{p}[0] = \frac{1}{4}, \ R_{p}[1] = 0, \ R_{p}[2] = -\frac{1}{16}, \ R_{p}[3] = -\frac{1}{16}, \ R_{p}[4] = -\frac{1}{32} \cdots$$
$$S_{x_{div2}}(\omega) \approx \frac{4F_{B}}{\omega^{2}} \left(\frac{1}{4} - \frac{1}{8}\cos(2\omega T_{B}) - \frac{1}{8}\cos(3\omega T_{B}) - \frac{1}{16}\cos(4\omega T_{B})\right)$$
(4.7)

Plotting expressions (4.6, 4.7) for a 2Gbps NRZ random input stream, the peaking at 250MHz and 125MHz at output of each of the divide-by-2 stage is revealed.



Figure 4.37: PSD of the first two divider stages for a 2 Gbps random input data.

# CHAPTER 5. MULTIMODE SOURCE-SYNCHRONOUS RECEIVER

## 5.1 Introduction

High-speed wireline receiver circuits capable of detecting multiple signaling formats offer flexibility in reconfiguring the receiver, based on the channel loss profile. This allows optimization of the system performance with a single-chip multi-mode solution. The choice of signaling format (NRZ, duobinary and PAM4) is mainly dictated by the channel loss, peak transmitter swing limitation and receiver complexity [69]. Even though analog-to-digital converter (ADC) based receivers are capable of addressing the requirements of a multi-mode receiver at a fixed data rate, they are extremely power hungry [70, 71].

In this chapter, we present a source-synchronous receiver architecture that uses reconfigurable slicer bank (of only six slicers) to address the ensemble requirements of the NRZ, RZ, duobinary, and PAM4 signaling formats. Using decisions from the slicer bank, the receiver is capable of symbol-rate phase recovery, thus obviating the need for additional edge samplers required in conventional oversampled architectures. The prototype multi-mode receiver employs a low power, high bandwidth phase rotating PLL (PR-PLL) and achieves 2.8mW/Gbps power efficiency at 3.2Gbps data rate across all the signaling formats.

## 5.2 Proposed Architecture



Figure 5.1: Architecture of multi-mode receiver.

Fig. 5.1 shows the multi-mode receiver architecture. It consists of a bank of 6 slicers, a data and phase decoder followed by the de-skew logic that controls the PR-PLL. The PLL provides the sampling clocks,  $CK_a$  and  $CK_b$ , to the slicer bank. The data decoder recovers the data from the raw slicer decisions and the symbol-rate phase decoder determines the sampling clock position within the data eye and generates early/late decisions that drive PR-PLL output to the center of the eye.

# 5.3 Implementation Details

#### 5.3.1 Reconfigurable slicer bank : Threshold and clocking

Fig. 5.2 shows the position of slicer thresholds, S1 to S4, in each signaling mode for resolving the data samples. With NRZ, duobinary and RZ signaling formats, the receiver operates in half-rate mode and with PAM4 inputs, it operates at full-rate. Data decoding required for duobinary and PAM4 formats is implemented using CMOS logic.

#### 5.3.2 Symbol-rate phase recovery

Slicer decisions at the waveform peaks (denoted as ES0 and ES1) available from the S5 and S6 in the slicer bank are employed for symbol-rate phase detection. The critical transition used to detect phase information irrespective of the signaling format is 011 [70]. As shown in Fig. 5.3, the sampling clock phase is early when ES0=0 and ES1=1. Similarly, the sampling clock phase is late when ES0=1 and ES1=0. Using a qualifier signal defined as two adjacent data samples D0 and D1 being equal to 1 (indicating an 11 occurrence), the phase error (-1,0,+1) is given by the equations shown in Fig 2. In NRZ, duobinary and PAM4 modes of operation, the same phase decoding logic is used and only the data samples provided to the phase decoder are varied using a multiplexer. The lack of a 011 transition renders this phase decoding logic inappropriate in the RZ mode of operation and hence the PR-PLL de-skew control word is set externally when receiving RZ data.



Figure 5.2: Signal eye diagrams and slicer threshold and clocking arrangement in each mode.



Figure 5.3: Symbol rate phase recovery.

# 5.3.3 Low power phase rotating PLL

Fig. 5.4 shows the block diagram of the XOR-based PR-PLL. The PLL achieves phase shifting by interpolating between two adjacent VCO phases based on the input digital control word, DC. Compared to a conventional I/Q phase interpolator [27], this topology does not need a slew rate controller to achieve good linearity in the phase steps. In contrast to a conventional 3-state PFD, XOR phase detectors can operate at very high frequencies, which allows a very wide PLL bandwidth, limited only by the forwarded clock frequency. Therefore, the PLL bandwidth can be optimized to filter high frequency jitter components in the forwarded clock and low bandwidth de-skew loop can be used to filter the data jitter and minimize dithering jitter due to loop latency. Because the wide PLL bandwidth suppresses the VCO phase noise, the oscillator power can be greatly



Figure 5.4: Low power phase rotating PLL.

reduced. As a result, the power dissipation in a PR-PLL is dominated by the power consumed in XOR phase detectors and voltage-to-current (V-I) converter needed to drive the passive loop filter.

In [72], current-mode logic (CML) used to implement XOR gates and highfrequency V-to-I converter greatly increased the PR-PLL power and degraded the phase noise. Leveraging the fact that phase interpolation accuracy is independent of slew-rate, CMOS XOR gates are combined with V-to-I converters to reduce power, as shown in Fig. 3. For further power savings, the number of XOR phase detectors is also reduced by using a quadrant-select multiplexer at the output of the VCO, which also reduces the power in the clock distribution network. The quadrant select multiplexer, controlled by the two most significant bits (MSBs) of the phase shift control word, selects two adjacent clocks phases,  $\phi_1$  and  $\phi_2$ , corresponding to the quadrant in which phase interpolation occurs. The four least significant bits (LSBs) vary currents  $I_1$  and  $I_2$  in each of the two XOR phase detectors, such that the total current flowing out the phase detectors is always constant (Fig. 3). This segmented approach of using a quadrant multiplexer and only two phase-detectors in implementing a PR-PLL is easily scalable when dealing with larger number of VCO phases to achieve better phase resolution in the PR-PLL.

# 5.4 Measured results

A prototype multi-mode receiver implemented in a  $0.13\mu$ m CMOS technology was packaged in TQFP48 and operates from 2Gbps to 3.2Gbps with a supply voltage of 1.2V. The active die area is 0.17mm<sup>2</sup>. The die photo is shown in Fig. 5.5.



Figure 5.5: Die Micrograph.



Figure 5.6: PLL jitter and phase noise at 1.6GHz.

Fig. 5.6 and 5.7 illustrates the stand alone characterization results of the XOR PR-PLL. The measured long term accumulated jitter of the PR-PLL is  $1.1 \text{ps}_{\text{rms}}$  and  $8.96 \text{ps}_{\text{pk-pj}}$  over 100k hits. The random jitter obtained by integrating the phase noise spectrum from 10kHz to 1GHz is  $714 \text{fs}_{\text{rms}}$  (indicating significant trigger jitter contribution in the long term accumulated jitter measurement). Overlay of the 64 phase steps of the PR-PLL is also shown in Fig. 5.7 and the measured DNL is less than  $\pm 0.6 \text{LSB}$  and INL is less than  $\pm 2.2 \text{ LSB}$ .



Figure 5.7: 64 discrete phase steps in the phase rotating PLL.

The receiver is characterized in all four signaling formats (NRZ, duobinary, PAM4 and RZ) at 3.2Gbps using a PRBS 2<sup>7</sup>-1 pattern with a fixed peak-to-peak transmit swing of 240mVppd. The power dissipation is 8.9mW (excluding output buffers) independent of the input data format, yielding a power figure of merit of 2.8mW/Gbps. The bit-error-rate (BER) of the receiver was measured by passing the transmit data through 3 different channels: 2" on-board FR4 trace (short), 10" FR4 channel (medium, -6.5dB loss at 1.6GHz) and 23 FR4 channel (long, -10dB loss at 1.6GHz). The eye diagrams at the receive end of the channels are shown in Fig. 5.8. Despite operating at half the baud-rate with multilevel signaling modes (duobinary and PAM4), loss of voltage margin at the receive slicer degrades the



BER of these signaling modes in the medium and long channel cases.

Figure 5.8: Measured receive eye diagrams at 3.2Gbps while transmitting on the medium and long FR4 traces.

While transmitting in RZ mode on the medium and long channels, pulse spreading causes increased eye opening at the receive slicer input, indicating lesser susceptibility of the RZ format to intersymbol interference (ISI). Fig. 5.9 shows the measured BER bathtub curves for the receiver with all four signaling formats using short/medium/long channels. The BER bathtub curves show that the receiver is capable of operating with BER  $< 10^{-12}$  in NRZ and RZ formats with all the three channels and the performance of multilevel signaling schemes degrade rapidly with lossy channels. The performance of the receiver is tabulated in Table 5.1.



Figure 5.9: Measured receive eye diagrams at 3.2Gbps while transmitting on the medium and long FR4 traces.

| Technology                   | $0.13 \mu m CMOS$                 |
|------------------------------|-----------------------------------|
| Supply voltage               | 1.2V                              |
| Data rate @ $V_{DD}$ =1.2V   | 2-4Gbps                           |
| Clocking                     | Source synchronous, XOR PLL based |
| Deskew loop                  | Internal                          |
| Supported signalling formats | NRZ/RZ/Duobinary/PAM4             |
| FOM at 3.2Gbps               | $2.8 \mathrm{mW/Gbps}$            |
| PLL FOM                      | $1.95 \mathrm{mW/GHz}$            |
| Die area                     | $0.17 \mathrm{mm}^2$              |

Table 5.1: Multimode receiver performance summary

# 6.1 Conclusions

This work explored design techniques that leverage digital implementation of conventional analog circuits to address the requirements of high speed serial transceivers capable of operating over a wide range of data rates with a stringent power budget (mW/Gbps) while being fully integrated.

In Chapter 3, the serial transceiver *current-recycling* is used as a means to reduce power dissipation in the clock generation blocks of a serial transceiver. Current recycling was achieved by stacking the highly digital transmitter phase locking loop and receiver frequency locking loop across a 1.2V supply.

In Chapter 4, a referenceless frequency acquisition technique for clock and data recovery circuits was introduced. Employing this technique in CDRs obviates the need for an external crystal oscillator thereby reducing external component cost. The frequency detector is capable of extracting a reference tone from the incoming random data. The extracted reference tone is used for frequency locking the oscillator in the CDR. The novel frequency detector has very low power dissipation and unlimited acquisition range.

In Chapter 5, a source synchronous receiver capable of resolving multiple signalling formats was presented. This receiver employs an optimal and a reconfigurable slicer bank to resolve NRZ, RZ, Duobinary and PAM4 signalling formats. Phase locking in the receiver was achieved using a very low power XOR based phase rotating PLL.

# 6.2 Future work

In the serial transceiver chip, a combination of current recycling and dynamic voltage scaling based on the operating data rate will help scale power dissipation aggressively at very low data rates. Further, extending the serial transceiver design with equalization on both the transmitter and the receiver would improve signalling capability over high-loss channels.

In the referenceless CDR prototype, an extended data rate range towards the lower end can be achieved by using a combination of programmable dividers after the oscillator, thereby reducing the burden of supporting the entire frequency range.

In the multi-mode receiver prototype, external voltage thresholds have been used for the reconfigurable slicer bank. Implementing an on-chip threshold adjustment circuit coupled with an offset cancellation mechanism for the slicer bank, would make the receiver more robust and improve testability.

#### BIBLIOGRAPHY

- D. Park, W. Lee, S. Jeon, and S.H. Cho, "A 2.5-GHz 860µW charge-recycling fractional-N frequency synthesizer in 130nm CMOS," in *IEEE Symposium* on VLSI Circuits, 2008, pp. 88–89.
- [2] M. Horowitz, C.K.K. Yang, and S. Sidiropoulos, "High-Speed Electrical Signaling: Overview and Limitations," *IEEE Micro*, vol. 18, no. 1, pp. 12– 24, 1998.
- [3] B. Razavi, *Design of Integrated Circuits For Optical Communications*, McGraw-Hill Science/Engineering/Math, 2003.
- [4] M.J.E. Lee, WJ Dally, R. Farjad-Rad, H.T. Ng, R. Senthinathan, J. Edmondson, and J. Poulton, "CMOS high-speed I/O's - present and future," in *International Conference on Computer Design*, 2003, pp. 454–461.
- [5] P.K. Hanumolu, G. Wei, and U. Moon, "Equalizers for high-speed serial links," *International Journal of High Speed Electronics and Systems*, vol. 15, no. 2, pp. 429–458, 2005.
- [6] P. Heydari and R. Mohanavelu, "Design of ultra high-speed low-voltage CMOS CML buffers and latches," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 12, no. 10, pp. 1081–1093, 2004.
- [7] H. Hatamkhani, K.L.J. Wong, R. Drost, and C.K.K. Yang, "A 10-mW 3.6 Gbps I/O transmitter," in *IEEE Symposium on VLSI Circuits*, 2003, pp. 97–98.
- [8] V. Stojanovic and VG Oklobdzija, "Comparative analysis of master-slave latches and flip-flops for high-performance and low-power systems," *IEEE Journal of Solid-State Circuits*, vol. 34, no. 4, pp. 536–548, 1999.
- [9] KS Oh, F. Lambrecht, S. Chang, Q. Lin, J. Ren, C. Yuan, J. Zerbe, and V. Stojanovic, "Accurate System Voltage and Timing Margin Simulation in High-Speed I/O System Designs," *IEEE Transactions on Advanced Packaging*, vol. 31, no. 4, pp. 722–730, 2008.
- [10] A. Fiedler, R. Mactaggart, J. Welch, and S. Krishnan, "A 1.0625 Gbps Transceiver with 2X-Oversampling and Transmit Signal Pre-Emphasis," in *IEEE International Solid-State Circuits Conference*, 1997, pp. 238–239.
- [11] C.K.K. Yang and M.A. Horowitz, "A 0.8 μm CMOS 2.5 Gb/s Oversampling Receiver and Transmitter for Serial Links," *IEEE Journal Of Solid-State Circuits*, vol. 31, no. 12, pp. 2015, 1996.

- [12] TH Hu and PR Gray, "A monolithic 480 Mb/s parallel AGG/decision/clockrecovery circuitin 1.2-μm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 28, no. 12, pp. 1314–1320, 1993.
- TH Lee, KS Donnelly, JTC Ho, J. Zerbe, MG Johnson, and T Ishikawa, "A
   2.5 V CMOS delay-locked loop for 18 Mbit, 500 megabyte/s DRAM," *IEEE Journal of Solid-State Circuits*, vol. 29, no. 12, pp. 1491–1496, 1994.
- [14] G. Balamurugan, J. Kennedy, G. Banerjee, JE Jaussi, M. Mansuri, F. O'Mahony, B. Casper, and R. Mooney, "A scalable 5–15 Gbps, 14–75 mW low-power I/O transceiver in 65 nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 4, pp. 1010–1019, 2008.
- [15] B. Casper and F. O'Mahony, "Clocking Analysis, Implementation and Measurement Techniques for High-Speed Data Links - A Tutorial," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 56, no. 1, pp. 17–39, 2009.
- [16] L. DeVito, J. Newton, R. Croughwell, J. Bulzacchelli, and F. Benkley, "A 52MHz and 155MHz clock-recovery PLL," in *IEEE International Solid-State Circuits Conference*, 1991, pp. 142–306.
- [17] B. Razavi and J. Sung, "A 2.5-Gb/sec 15-mW BiCMOS Clock: Recovery Circuit," in *IEEE Symposium on VLSI Circuits*, 1995, p. 83.
- [18] D. Dalton, K. Chai, E. Evans, M. Ferriss, D. Hitchcox, P. Murray, S. Selvanayagam, P. Shepherd, and L. DeVito, "A 12.5-Mb/s to 2.7-Gb/s continuous-rate CDR with automatic frequency acquisition and data-rate readback," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 12, pp. 2713– 2725, 2005.
- [19] R. Farjad-Rad, C.K.K. Yang, M Horowitz, and TH Lee, "A 0.4-µm CMOS 10-Gb/s 4-PAM pre-emphasis serial link transmitter," *IEEE Journal of Solid-State Circuits*, vol. 34, no. 5, pp. 580–585, 1999.
- [20] JL Zerbe, CW Werner, V. Stojanovic, F. Chen, J. Wei, G. Tsang, D. Kim, WF Stonecypher, A. Ho, and TP Thrush, "Equalization and clock recovery for a 2.5-10-Gb/s 2-PAM/4-PAM backplane transceiver cell," *IEEE Journal* of Solid-State Circuits, vol. 38, no. 12, pp. 2121–2130, 2003.
- [21] R. Mooney, C. Dike, and S. Borkar, "A 900 Mb/s bidirectional signaling scheme," *IEEE Journal of Solid-State Circuits*, vol. 30, no. 12, pp. 1538– 1543, 1995.

- [22] J. Sonntag and R. Leonowich, "A monolithic CMOS 10 MHz DPLL for burstmode data retiming," in *IEEE International Solid-State Circuits Conference*, 1990, pp. 194–195.
- [23] M. Horowitz, A. Chan, J. Cobrunson, J. Gasbarro, T. Lee, W. Leung, W. Richardson, T. Thrush, and Y. Fujii, "PLL design for a 500 MB/s interface," in *IEEE International Solid-State Circuits Conference*, 1993, pp. 160–161.
- [24] S. Sidiropoulos and MA Horowitz, "A semidigital dual delay-locked loop," *IEEE Journal of Solid-State Circuits*, vol. 32, no. 11, pp. 1683–1692, 1997.
- [25] P. Larsson, "A 2-1600 MHz 1.2-2.5 V CMOS clock-recovery PLL with feedback phase-selection and averaging phase-interpolation for jitter reduction," in *IEEE International Solid-State Circuits Conference*, 1999, pp. 356–357.
- [26] P. Hanumolu, G.Y. Wei, and U.K. Moon, "A wide tracking range 0.2–4 Gbps clock and data recovery circuit," in *IEEE Symposium on VLSI Circuits*, 2006, pp. 88–89.
- [27] JF Bulzacchelli, M. Meghelli, SV Rylov, W. Rhee, AV Rylyakov, HA Ainspan, BD Parker, MP Beakes, A. Chung, TJ Beukema, et al., "A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 12, pp. 2885–2900, 2006.
- [28] J. Poulton, R. Palmer, AM Fuller, T. Greer, J. Eyles, WJ Dally, and M Horowitz, "A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 42, no. 12, pp. 2745–2757, 2007.
- [29] D. Harris, R. Ho, G.Y. Wei, and M. Horowitz, "The fanout-of-4 inverter delay metric," http://www3.hmc.edu/~harris/research/FO4.pdf.
- [30] J. Lee and B. Razavi, "A 40-Gb/s clock and data recovery circuit in 0.18 μm CMOS technology," *IEEE Journal of Solid-State Circuits*, vol. 38, no. 12, pp. 2181–2190, 2003.
- [31] T. Toifl, C. Menolfi, P. Buchmann, C. Hagleitner, M. Kossel, T. Morf, J. Weiss, and M Schmatz, "A 72mW 0.03 mm<sup>2</sup> Inductorless 40 Gb/s CDR in 65nm SOI CMOS," in *IEEE International Solid-State Circuits Conference*, 2007, pp. 226–598.
- [32] N. Nedovic, N. Tzartzanis, H. Tamura, F. Rotella, M. Wiklund, Y. Mizutani, Y. Okaniwa, T. Kuroda, J. Ogawa, and W. Walker, "A 40-to-44Gb/s 3× Oversampling CMOS CDR/1: 16 DEMUX," in *IEEE International Solid-State Circuits Conference*, 2007, pp. 224–598.

- C. Kromer, G. Sialm, C. Menolfi, M. Schmatz, F. Ellinger, and H. Jackel, "A 25-Gb/s CDR in 90-nm CMOS for High-Density Interconnects," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 12, pp. 2921–2929, 2006.
- [34] L.C. Cho, C. Lee, and S.I. Liu, "A 33.6-to-33.8 Gb/s Burst-Mode CDR in 90nm CMOS," in *IEEE International Solid-State Circuits Conference*, 2007, pp. 48–586.
- [35] C. Menolfi, T. Toifl, R. Reutemann, M. Ruegg, P. Buchmann, M. Kossel, T. Morf, and M. Schmatz, "A 25Gb/s PAM4 transmitter in 90nm CMOS SOI," in *IEEE International Solid-State Circuits Conference*, 2005, pp. 72– 73.
- [36] Y. Amamiya, S. Kaeriyama, H. Noguchi, Z. Yamazaki, T. Yamase, K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H. Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka, A. Tanabe, S. Fujita, and N. Kawahara, "A 40 Gb/s multi-data-rate CMOS transceiver chipset with SFI-5 interface for optical transmission systems," in *IEEE International Solid-State Circuits Conference*, 2009, pp. 358–359.
- [37] RH Dennard, FH Gaensslen, H.N. Yu, VL Rideout, E. Bassous, and AR LeBlanc, "Design of ion-implanted MOSFET's with very small physical dimensions," *Proceedings of the IEEE*, vol. 87, no. 4, pp. 668–678, 1974.
- [38] T.C. Chen, "Where CMOS is going: trendy hype vs. real technology," in *IEEE International Solid-State Circuits Conference*, 2006, pp. 1–18.
- [39] KLJ Wong, M. Mansuri, H. Hatamkhani, and C.K.K.Y. Yang, "A 27-mW 3.6-Gb/s I/O transceiver," in *IEEE Symposium on VLSI Circuits*, 2003, pp. 99–102.
- [40] S. Palermo, A. Emami-Neyestanak, and M. Horowitz, "A 90 nm CMOS 16 Gb/s transceiver for optical interconnects," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 5, pp. 1235, 2008.
- [41] R. Palmer, J. Poulton, B. Leibowitz, Y. Frans, S. Li, A. Fuller, J. Eyles, J. Wilson, M. Aleksic, T. Greer, M. Bucher, and N. Nguyen, "A 4.3 GB/s Mobile Memory Interface With Power-Efficient Bandwidth Scaling," in *IEEE Symposium on VLSI Circuits*, 2009, pp. 136–137.
- [42] H. Wang, C. Lee, A. Lee, and J. Lee, "A 21-Gb/s 87-mW Transceiver with FFE/DFE/Linear Equalizer in 65-nm CMOS Technology," in *IEEE Symposium on VLSI Circuits*, 2009, pp. 50–51.

- [43] J. Seo, D. Sylvester, and D. Blaauw, "Crosstalk-Aware PWM-Based On-Chip Global Signaling in 65nm CMOS," in *IEEE Symposium on VLSI Circuits*, 2009, pp. 88–89.
- [44] S. Joshi, J. Liao, Y. Fan, S. Hyvonen, M. Nagarajan, J. Rizk, H.Lee, and I. Young, "A 12-Gb/s Transceiver in 32-nm Bulk CMOS," in *IEEE Sympo*sium on VLSI Circuits, 2009, pp. 52–53.
- [45] T. Takahashi, T. Muto, Y. Shirai, F. Shirotori, Y. Takada, A. Yamagiwa, A. Nishida, A. Hotta, and T. Kiyuna, "A 110-GB/s Simultaneous Bidirectional Transceiver Logic Synchronized with a System Clock," *IEEE Journal* of Solid State Circuits, vol. 34, no. 11, pp. 1526–1533, 1999.
- [46] H. Johnson, "Multi-Level Signaling," DesignCon, 2000.
- [47] D. Kam, "Multi-level Signaling in High-density, High-speed Electrical Links," DesignCon, 2008.
- [48] J. Kim and M. Horowitz, "Adaptive supply serial links with sub-1-V operation and per-pin clock recovery," *IEEE Journal of Solid-State Circuits*, vol. 37, no. 11, pp. 1403–1413, 2002.
- [49] S. Rajapandian, Z. Xu, and K.L. Shepard, "Implicit DC-DC downconversion through charge-recycling," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 4, pp. 846–852, 2005.
- [50] R. Inti, A. Elshazly, B. Young, W. Yin, M. Kossel, T. Toifl, and P. Hanumolu, "A highly digital 0.5-to-4Gb/s 1.9mW/Gb/s Serial-link Transceiver using current-recycling in 90nm CMOS," in *IEEE International Solid-State Circuits Conference*, 2011, pp. 152–153.
- [51] D. Messerschmitt, "Frequency detectors for PLL acquisition in timing and carrier recovery," *IEEE Transactions on Communications*, vol. 27, no. 9, pp. 1288–1295, 1979.
- [52] H. Ransijn and P. O'Connor, "A PLL-based 2.5-Gb/s GaAs clock and data regenerator IC," *IEEE Journal of Solid-State Circuits*, vol. 27, no. 10, pp. 1345–1353, 1991.
- [53] A. Pottbacker, J. Langmann, and H. Schreiber, "A Si bipolar phase and frequency detector IC for clock extraction up to 8 Gb/s," *IEEE Journal of Solid-State Circuits*, vol. 27, no. 12, pp. 1747–1751, 1992.
- [54] B. Razavi, "A 2.5-Gb/s 15-mW clock recovery circuit," IEEE Journal of Solid-State Circuits, vol. 31, no. 4, pp. 472–480, 1996.

- [55] M. Perrott, Y. Huang, R. Baird, B. Garlepp, L. Zhang, and J. Hein, "A 2.5 Gb/s multi-rate 0.25μm CMOS CDR utilizing a hybrid analog/digital loop filter," in *IEEE International Solid-State Circuits Conference*, 2006, pp. 328–329.
- [56] R. Yang, K. Chao, S. Hwu, C. Liang, and S. Liu, "A 155.52 Mbps to 3.125 Gbps continuous-rate clock and data recovery circuit," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 6, pp. 1380–1390, 2006.
- [57] J. Lee and K. Wu, "A 20Gb/s full-rate linear CDR circuit with automatic frequency acquisition," in *IEEE International Solid-State Circuits Conference*, 2009, pp. 336–337.
- [58] R. Inti, W. Yin, A. Elshazly, and P. Hanumolu, "A 0.5-to-2.5 Gb/s referenceless half-rate digital CDR with unlimited frequency acquisition range and improved input duty-cycle error tolerance," in *IEEE International Solid-State Circuits Conference*, 2011, pp. 228–229.
- [59] V. Kratyuk, P. Hanumolu, U. Moon, and K. Mayaram, "A design procedure for all-digital phase-locked loops based on a charge-pump phase-locked-loop analogy," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 54, no. 3, pp. 247–251, 2007.
- [60] P Hanumolu, M Kim, G Wei, and U. Moon, "A 1.6 Gbps digital clock and data recovery circuit," in *IEEE Custom Integrated Circuits Conference*, 2006, pp. 603–606.
- [61] R. Schreier and G.C. Temes, Understanding Delta-Sigma Data Converters, IEEE Press New Jersey, 2005.
- [62] W. Yin, R. Inti, A. Elshazly, B. Young, and P. Hanumolu, "A 0.7-to-3.5 GHz 0.6-to-2.8 mW Highly Digital Phase-Locked Loop With Bandwidth Tracking," *IEEE Journal of Solid-State Circuits*, vol. 99, no. 8, pp. 1–11, 2011.
- [63] J. Alexander, "Clock recovery from random binary signals," *Electronic Letters*, vol. 41, no. 10, pp. 541–542, 1975.
- [64] B. Nikolic, V. Oklobdzija, V. Stojanovic, W. Jia, J. Chiu, and M. Leung, "Improved sense-amplifier-based flip-flop: design and measurements," *IEEE Journal of Solid-State Circuits*, vol. 35, no. 6, pp. 876–884, 2000.
- [65] J. Sonntag and J. Stonick, "A digital clock and data recovery architecture for multi-gigabit/s binary links," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 12, pp. 1867–1875, 2006.

- [66] S. Lee, Y. Kim, H. Ha, Y. Seo, H. Park, and J. Sim, "A 650Mb/s-to-8Gb/s referenceless CDR circuit with automatic acquisition of data rate," in *IEEE International Solid-State Circuits Conference*, 2009, pp. 184–185.
- [67] D.H. Oh, D.S. Kim, S. Kim, D.K. Jeong, and W. Kim, "A 2.8 Gb/s alldigital CDR with a 10b monotonic DCO," in *IEEE International Solid-State Circuits Conference*, 2007, pp. 222–223.
- [68] B.P. Lathi, Modern digital and analog communication systems, Oxford University Press, 1995.
- [69] G. Balamurugan, F. O'Mahony, M. Mansuri, J.E. Jaussi, J.T. Kennedy, and B. Casper, "A 5-to-25Gb/s 1.6-to-3.8 mW/(Gb/s) reconfigurable transceiver in 45nm CMOS," in *IEEE International Solid-State Circuits Conference*, 2010, pp. 373–373.
- [70] M. Harwood, N. Warke, R. Simpson, T. Leslie, A. Amerasekera, S. Batty, D. Colman, E. Carr, V. Gopinathan, S. Hubbins, et al., "A 12.5Gb/s SerDes in 65nm CMOS using a baud-rate ADC with digital receiver equalization and clock recovery," in *IEEE International Solid-State Circuits Conference*, 2007, pp. 436–437.
- [71] J. Cao, B. Zhang, U. Singh, D. Cui, A. Vasani, A. Garg, W. Zhang, N. Kocaman, D. Pi, B. Raghavan, et al., "A 500mW digitally calibrated AFE in 65nm CMOS for 10Gb/s Serial links over backplane and multimode fiber," in *IEEE International Solid-State Circuits Conference*, 2009, pp. 370–371.
- [72] T. Toifl, C. Menolfi, P. Buchmann, M. Kossel, T. Morf, R. Reutemann, M. Ruegg, M.L. Schmatz, and J. Weiss, "A 0.94-ps-rms-jitter 0.016-mm<sup>2</sup> 2.5GHz multiphase generator PLL with 360° digitally programmable phase shift for 10-Gb/s serial links," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 12, pp. 270–282, 2005.