#### AN ABSTRACT OF THE DISSERTATION OF

<u>Rui Bai</u> for the degree of <u>Doctor of Philosophy</u> in <u>Electrical and Computer</u> Engineering presented on <u>May 28, 2014</u>.

Title: Design Techniques for Low-Power Electrical and Optical Serial Link Receivers.

Abstract approved:

Patrick Y. Chiang

As computation power continues to grow, the demand for data transfer bandwidth is also rising. This is reflected in the increasing data-rate of high-speed links. However, the increase in data-rate is sustainable only if the I/O energy efficiency improves as well. This dissertation explores several techniques to enable high-speed links with low power consumption.

First, a serial link receiver with scalable supply voltage for different data-rates for optimum energy efficiency is presented. Low-voltage operation is proven to be an effective way to reduce power consumption, but it has not been widely adopted in high-speed link design due to associated design challenges. The proposed receiver uses an injection-locked ring oscillator (ILRO) for low-power clock recovery and deskewing with wide jitter-tracking bandwidth.

Optical link has become increasingly attractive due to the potential to deliver high aggregated bandwidth over longer distance compared to electrical links. The next design applies the architecture presented previously to an optical receiver in a wavelength-division modulated (WDM) link. Per-channel adaptation is built into the front-end transimpedance amplifier (TIA), which usually accounts for the highest power consumption, to enable energy optimization in the presence of prevalent variation. Built-in monitoring and controlling circuits facilitates automatic adaptation of the link.

Lastly, a low-power decision-feedback equalizer (DFE) using charge-based latch is presented. Designing an equalizer for low-voltage links can be particularly challenging because it usually has the highest bandwidth among all components. The proposed DFE with charge-based latch retains the low power consumption of a dynamic latch while achieving comparable speed of power-hungry current-mode logic (CML) circuits. ©Copyright by Rui Bai May 28, 2014 All Rights Reserved

## Design Techniques for Low-Power Electrical and Optical Serial Link Receivers

by Rui Bai

### A DISSERTATION

submitted to

Oregon State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Presented May 28, 2014 Commencement June 2014 Doctor of Philosophy dissertation of Rui Bai presented on May 28, 2014

APPROVED:

Major Professor, representing Electrical and Computer Engineering

Director of the School of Electrical Engineering and Computer Science

Dean of the Graduate School

I understand that my dissertation will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my dissertation to any reader upon request.

Rui Bai, Author

#### ACKNOWLEDGEMENTS

The path leading to PhD is long and winding. I would have ended up at a very different place, without the help and influence of numerous people.

I want to first thank my advisor, Professor Patrick Chiang. He encouraged me to jump right into my research instead of being intimidated by the challenge. It took some time for me to get used to his enthusiastic and out-of-the-box way of thinking at first, but after a while I found it to be always pleasant and inspiring. I'm grateful for having him as a mentor and a friend.

I thank Prof. Moon, Prof. Natarajan, Prof. Nguyen and Prof. Leonard Coop for finding time in their busy schedules to serve on my final committee. Their questions and suggestions greatly helped me in shaping this dissertation and my final defense.

I thank Prof. Palermo at Texas A & M for the guidance on our collaborated project. I thank the help and mentoring of Tom Gray, John Poulton and all the other guys at NVIDIA Circuit Research Group during my internship. I thank Marco Fiorentino and Chin-Hui Chen at HP Labs, John Calvin at Tektronix for the help and support that made the measurement of my chips possible.

I'm very lucky to get to know many brilliant students. I had a great time working, learning and having fun with them. I'd like to thank Kangmin Hu, Tao Jiang, Jingguang Wang, Changhui Hu, Jacob Postman, Jiao Cheng, Joe Crop, Robert Pawlowski, Lingli Xia, Chao Ma, Ben Goska, Ryan Albright, Tom Ruggeri, Hao Li, Vahid Behravan, Neil Glover, Nan Qi, Liqiong Yang, Shuai Chen from the VLSI research group; Xin Meng, Tao Tong, Jiaming Lin, Jingzhou Cao, Tao Wang, Wei Li, Hurst Kuo, Yi Zhang, Tao He from Prof. Temes' group; Yue Hu, Hari Venkatram, Yang Xu from Prof. Moon's group; Rajesh Inti, Amr Elshazly, Guanghua Shu, Mrumnay Talegaonkar, Tejasvi Anand from Prof. Hanumolu's group; Chao Shi from Prof. Fiez's group; Yao Liu and Jian Kang from Prof. Natarajan's group; Wei Liu, Ruiqing Ye, Jin Wang, Weiting Chen from Prof. Liu's group; Younghoon Song, Cheng Li from Prof. Palermo's group at Texas A & M; Yue Lu from Prof. Alon's group at UC Berkeley. The list goes on.

My parents never stop being the most amazing source of guidance, inspiration, support and love. To them I dedicate this dissertation.

# TABLE OF CONTENTS

## Page

| Chapter 1. Introduction                            | 1  |
|----------------------------------------------------|----|
| 1.1 Motivation                                     | 1  |
| 1.2 Thesis Organization                            | 2  |
| Chapter 2. Low-Voltage Electrical Transceiver      | 4  |
| 2.1 Introduction                                   | 4  |
| 2.2 Receiver Architecture Considerations           | 6  |
| 2.3 Receiver with Injection-Locked Oscillator      | 11 |
| 2.4 Experimental Results                           | 14 |
| 2.5 Summary                                        | 16 |
| Chapter 3. Adaptive Optical Receiver               |    |
| 3.1 Introduction                                   | 18 |
| 3.2 Optical Transceiver Architecture               | 25 |
| 3.3 Optical Forwarded-Clock Adaptive Receiver      | 26 |
| 3.4 Experimental Results                           | 31 |
| 3.5 Summary                                        |    |
| Chapter 4. Low-Voltage Decision-Feedback Equalizer |    |
| 4.1 Introduction                                   | 37 |
| 4.2 Continuous-Time Linear Equalizer (CTLE)        |    |
| 4.3 Decision-Feedback Equalizer (DFE)              | 43 |
| 4.4 Charge-Based Dynamic Latch                     | 46 |

# TABLE OF CONTENTS (Continued)

# Page

| 4.5 Prototype DFE Implementation                            | 52 |
|-------------------------------------------------------------|----|
| 4.5.1 DFE Architecture and Timing                           | 52 |
| 4.5.2 Charge-Based S/H                                      | 55 |
| 4.5.3 Integrating summer with common-mode restoration (CMR) | 57 |
| 4.6 Experimental Results                                    | 67 |
| 4.7 Summary                                                 | 74 |
| Chapter 5. Conclusion                                       | 75 |
| 5.1 Summary                                                 | 75 |
| 5.2 Recommendation for Future Work                          | 76 |
| Bibliography                                                | 77 |

# LIST OF FIGURES

| <u>Figure</u>                                                                                                                                      | Page     |
|----------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| Figure 1.1: Per-pin data rate vs. year for various I/O standards                                                                                   | 1        |
| Figure 2.1: Forwarded-clock I/O architecture                                                                                                       | 5        |
| Figure 2.2: Forwarded-clock 1:N receiver architecture                                                                                              | 7        |
| Figure 2.3: Key receiver circuitry simulated performance versus supply voltage: (a) ri oscillator phase variation, (b) quantizer delay             | ng<br>8  |
| Figure 2.4: Receiver power consumption versus de-multiplexing factor                                                                               | 10       |
| Figure 2.5: Schematic of ILRO                                                                                                                      | 12       |
| Figure 2.6: Simulated impact of clock injection approach on phase spacing uniformity                                                               | 7.13     |
| Figure 2.7: Die micrograph of complete transceiver                                                                                                 | 14       |
| Figure 2.8: Measured receiver deskew range                                                                                                         | 15       |
| Figure 2.9: Measured transceiver energy efficiency                                                                                                 | 15       |
| Figure 2.10: (a) BER bathtub curve (b) BER vs TX VDD                                                                                               | 16       |
| Figure 3.1: Silicon ring resonator-based wavelength-division-multiplexing (WDM) lin                                                                | nk<br>19 |
| Figure 3.2: (a) Top and cross section views of carrier-injection silicon ring resonator modulator, (b) optical spectrum at through port            | 20       |
| Figure 3.3: Measured quality factor and resonance wavelength of nine 2.5µm radius silicon ring modulators fabricated on an 8" 130nm CMOS SOI wafer | 21       |
| Figure 3.4: Photonic transceiver circuits prototype block diagram                                                                                  | 24       |
| Figure 3.5: Adaptive sensitivity-power data receiver                                                                                               | 26       |
| Figure 3.6: Inverter-based TIA front-end: (a) schematic, (b) simulated TIA common-<br>mode output response to a 5mV power supply step              | 27       |
| Figure 3.7: Optical receiver sensitivity-power adaption algorithm                                                                                  | 28       |
| Figure 3.8: Optical clock receiver                                                                                                                 | 30       |

# LIST OF FIGURES (Continued)

| <u>Figure</u> <u>Pag</u>                                                                                                                                                     | <u>ge</u>     |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| Figure 3.9: Test chip wirebonded to photodiodes for receiver testing                                                                                                         | 31            |
| Figure 3.10: 8Gb/s receiver measurements: (a) BER versus input optical power with a 1 power supply. (b) Sensitivity (BER = 10-9) and power versus TIA supply                 | V<br>32       |
| Figure 3.11: Integrated photodetector emulator circuit                                                                                                                       | 33            |
| Figure 3.12: 10Gb/s receiver measurements with photodetector emulator circuit: (a) BE versus input current with a 1V power supply. (b) TIA power scaling versus input curren | R<br>ıt<br>34 |
| Figure 3.13: Optically forwarded-clock receiver measurements: (a) 2GHz recovered clock waveform, (b) jitter versus input optical power                                       | 35            |
| Figure 4.1: Schematic of a conventional CTLE                                                                                                                                 | 39            |
| Figure 4.2: Simulation of CTLE performance scaling with supply voltage                                                                                                       | 43            |
| Figure 4.3: Principle of DFE                                                                                                                                                 | 44            |
| Figure 4.4: Schematic and operation of the two-stage dynamic latch proposed by Schinkel                                                                                      | 47            |
| Figure 4.5: Required latch output swing into the summer                                                                                                                      | 49            |
| Figure 4.6: Schematic and operation of the proposed charge-based latch                                                                                                       | 50            |
| Figure 4.7: The one-stack nature of proposed charge-based latch                                                                                                              | 51            |
| Figure 4.8: Architecture of the proposed DFE                                                                                                                                 | 53            |
| Figure 4.9: DFE timing                                                                                                                                                       | 54            |
| Figure 4.10: Simulated relationship between on-resistance of a PMOS pass-gate S/H and supply voltage                                                                         | d<br>55       |
| Figure 4.11: Schematic of proposed charge-based S/H circuit                                                                                                                  | 56            |
| Figure 4.12: Simulated ISF of proposed charge-based latch and S/H                                                                                                            | 57            |
| Figure 4.13: Simulated gain of proposed charge-based latch and S/H                                                                                                           | 58            |

# LIST OF FIGURES (Continued)

| Figure                                                                                                                | Page |
|-----------------------------------------------------------------------------------------------------------------------|------|
| Figure 4.14: Schematic of the integrating summer                                                                      | 59   |
| Figure 4.15: Headroom issue of integrating summer at lower supply voltage                                             | 60   |
| Figure 4.16: Common mode restoration of integrating summer                                                            | 61   |
| Figure 4.17: Issue with using PMOS current source for CMR                                                             | 62   |
| Figure 4.18: Proposed circuit for summer CMR                                                                          | 63   |
| Figure 4.19: Simulation of summer CMR                                                                                 | 64   |
| Figure 4.20: Simulated waveforms (a) 16Gb/s input (b) 4Gb/s S/H output (c) 4Gb/s summer output (d) 4Gb/s latch output | 66   |
| Figure 4.21: Die micrograph of the DFE test chip                                                                      | 67   |
| Figure 4.22: Layout of the DFE core                                                                                   | 68   |
| Figure 4.23: DFE test chip measurement setup                                                                          | 69   |
| Figure 4.24: 16Gb/s input eye diagram after a 13dB loss channel                                                       | 70   |
| Figure 4.25: Measured BER bathtub curve with the 13dB loss channel                                                    | 70   |
| Figure 4.26: 16Gb/s input eye diagram after a 18dB loss channel                                                       | 71   |
| Figure 4.27: Measured BER bathtub curve with the 18dB loss channel                                                    | 71   |
| Figure 4.28: Power consumption breakdown                                                                              | 72   |
| Figure 4.29: Internal eye diagrams measured at summer output                                                          | 73   |

# LIST OF TABLES

| Table                                               | Page |
|-----------------------------------------------------|------|
| Table 3.1 RX Performance Summary And Comparisons    | 17   |
| Table 4.1 Comparison with Previous Work             |      |
| Table 5.1 Comparison with Previously Published DFEs | 74   |

## Daga

# Design Techniques for Low-Power Electrical and Optical Serial Link Receivers CHAPTER 1. INTRODUCTION

### **1.1 Motivation**



Figure 1.1: Per-pin data rate vs. year for various I/O standards

The ever-increasing computational power that's available in integrated circuits drive up the demand of bandwidth of high-speed links. This trend is likely to accelerate considering the shift to cloud computing, because data has to be transferred to servers to be processed and stored before sent back to users. The evolution of I/O standards over the past few years is summarized in Figure 1.1. [1] It can be observed that data rate roughly doubles every four years. Note that the data is for per-pin data

rate. Although aggregated data rate can be easily increased by increasing the number of pins, it cannot scale indefinitely because of physical limits, and often cannot scale at all due to practical constraints like size and cost.

The power of I/O circuits is proportional to its bandwidth. If I/O energy efficiency stagnates while bandwidth increases, I/O power will soon blow up and account for the majority of power consumption of a chip. This means that the growth of I/O bandwidth is only sustainable if energy efficiency continues to improve.

The focus of this work is to explore employing various techniques for low-power serial link receiver design. These includes low-voltage operation, adaptive supply voltage scaling, and charge-based dynamic circuits.

### **1.2 Thesis Organization**

This dissertation presents 3 designs that employ various techniques to achieve some of the best energy efficiencies for serial link receivers and building blocks. Some of the work is designed to be integrated into a transceiver for a complete link. For these parts, emphasis will be on the receiver design, with introduction to the full link architecture and discussion of top-level design choices.

Chapter 2 presents a serial link receiver with scalable supply voltage for different data-rates for optimum energy efficiency. Since the link uses a forwarded-clocking architecture, an injection-locked ring oscillator (ILRO) is introduced for clock recovery and deskewing with wide jitter-tracking bandwidth and low power consumption overhead.

Chapter 3 applies the architecture presented in Chapter 2 to an optical receiver in a wavelength-division modulated (WDM) link. Per-channel adaptation is built into the front-end transimpedance amplifier (TIA), which usually accounts for the highest power consumption, to enable energy optimization in the presence of prevalent variation. Built-in monitoring and controlling circuits facilitates automatic adaptation of the link.

Chapter 4 focuses on the design of a low-voltage decision-feedback equalizer (DFE). The designs in previous chapters achieve state-of-the-art energy efficiency, but a more powerful equalizer is needed to extend the data-rate above 10Gb/s. A charge-based latch that has low power consumption like dynamic circuits, and also rivals the speed of more power hungry current-mode logic (CML) circuits, is proposed. It is also leveraged as a high-bandwidth sample-and-hold (S/H) circuit. Lastly, a common-mode restoration (CMR) circuit is proposed to address the issue of reduced headroom of low-voltage operation. The DFE is implemented in 65nm CMOS and is measured to operate at 16Gb/s at 0.7V with an energy efficiency of 0.25pJ/bit.

### **CHAPTER 2. LOW-VOLTAGE ELECTRICAL TRANSCEIVER**

### **2.1 Introduction**

Total I/O bandwidth demand is growing in high-performance systems due to the emergence of many-core microprocessors and in mobile devices in order to support the next generation of multi-media features. High-speed serial I/O energy efficiency must improve in order to enable continued scaling of these parallel computing platforms in applications ranging from data centers to smart mobile devices.

Significant I/O energy efficiency improvements necessitate both advances in electrical channel technologies and circuit techniques in order to reduce complexity and power consumption. Examples of advanced inter-chip physical interfaces include high-density interconnect and Flex cable bridges, which allow operation at data rates near 10Gb/s while only requiring modest equalization [2]. An I/O architecture that reduces clocking circuit complexity, while also allowing for wide-bandwidth jitter tracking, is a forwarded-clock system where a clock signal is transmitted in parallel with multiple data channels (Figure 2.1) [3], [4]. Furthermore, low-power transceivers often incorporate voltage-mode transmit drivers, as these output stages have the potential to consume one-quarter of the power compared to current-mode drivers [5].



Figure 2.1: Forwarded-clock I/O architecture

Further improvements in energy-efficiency are possible through reduction of the supply voltage VDD into the near-threshold regime (VDD~0.65V). Previously, this has enabled excellent energy/computation for digital systems [6] due to the exponential dependence of power on VDD. Leveraging supply scaling to improve energy efficiency motivates I/O architectures that employ a high level of output/input multiplexing, as this allows for the parallel transmit and receive segments to operate at lower voltages [7]. However, challenges exist in the design of an efficient output-multiplexed voltage-mode driver due to the relatively large driver transistor sizes required for output impedance control, as well as the reduced supply headroom for the output stage regulator. Furthermore, widespread adoption of low-VDD transceivers has been limited due to questions regarding robust operation and severe sensitivity to process variations. In particular, the generation of precise multi-phase clocks and the

ability to compensate for circuit mismatch is an issue both at the transmitter and receiver.

This chapter describes a near-threshold forwarded-clock I/O architecture developed in 65nm CMOS that is capable of 4.8-8Gb/s operation while achieving an energy efficiency of 0.47pJ/bit-0.66pJ/bit. First we discusse key circuit trade-offs associated with supply-scaling and multiplexing factor choice for the receiver. Next, we present a 1:8 input de-multiplexing receiver which employs eight parallel input samplers clocked from an 8-phase injection-locked oscillator that provides more than 1UI deskew range. The transceiver experimental results are then presented, followed by conclusion.

#### 2.2 Receiver Architecture Considerations

At the receiver, the optimal input de-multiplexing ratio, in terms of power efficiency, is a function of the minimum voltage required to produce precise multiphase clocks while maintaining adequate circuit speed. An input continuous-time linear equalizer (CTLE), consisting of a RC-degenerated differential amplifier, is used to compensate for the channel loss. Figure 2.2 shows a high-level diagram of the receiver architecture in which it drives the N quantizers clocked by multi-phase clocks from an ILO locked to the forwarded clock.



Figure 2.2: Forwarded-clock 1:N receiver architecture

The ILRO also provides the ability to adjust for the skew between data and the sampling clock by adjusting its own free-running frequency, as demonstrated in [3].

CTLE equalization is chosen versus transmit feed-forward equalization (FFE) in this transceiver architecture, as link modeling studies have found that including a CTLE can achieve less power than a design without TX equalization or designs which include 2-tap TX equalization without a CTLE. This is because the CTLE allows for a peak gain above 0dB near the Nyquist frequency, which improves the sensitivity of the RX and allows scaling down the transmit output swing significantly. TX FFE, on the other hand, reduces the effective transmitted signal swing, placing more stringent requirements on the RX and also increases the TX circuit complexity. This is especially true for voltage-mode drivers, where significant output-stage segmentation and pre-drive logic is often necessary to achieve a given equalization range and resolution, both in designs which control the output impedance [8] and those that don't [9].

All of the receiver circuits share the same scalable power supply. A higher demultiplexing ratio relaxes the quantization delay requirement for each quantizer, allowing quantization speed to be traded off for lower supply voltage. For the chosen quantizer structure, which is similar to [10], near-quadratic power reduction is observed associated with supply voltage scaling.



Figure 2.3: Key receiver circuitry simulated performance versus supply voltage: (a) ring oscillator phase variation, (b) quantizer delay

It is important to note that while a highly parallel architecture sees improved power efficiency by operating at lower voltage, several limitations prevents carrying out this methodology indefinitely. The first limitation is that lower overdrive and headroom reduce the performance of analog components in the critical high-speed path. In the case of the CTLE, larger current is needed to maintain its bandwidth at a lower supply voltage, contradicting the effort to reduce power consumption. In turn, larger current and lower headroom also limit the size of the load resistor, making it difficult to achieve the required gain. The second limitation is that the use of more quantizers in parallel increases the loading of CTLE, thus decreasing the bandwidth. This loading includes the input capacitance of the quantizer itself, as well as the wiring parasitic, which becomes more significant as longer wires are needed for higher parallelism. The third limitation is that the variation of certain blocks is more sensitive to supply voltage than others. For example, Figure 2.3 (a) shows the simulated phase mismatch from 100 Monte-Carlo runs of an 8-phase ring oscillator across different supply voltages. Here the phase mismatch is normalized to the UI value corresponding to the frequency achievable at a given supply voltage. It can be observed that  $\sigma$  grows faster as it approaches the near-threshold region. In a receiver, large phase mismatch makes it difficult to align every clock edges for all the parallel quantizers to the proper position in the data eye simultaneously. As a result, the combined BER becomes worse as phase mismatch increases. While individual skew adjustment could be added to each clock phase, this comes at the expense of additional mismatch detection and correction circuitry.



Figure 2.4: Receiver power consumption versus de-multiplexing factor

To evaluate the effectiveness of different de-multiplexing ratio and supply voltage combinations in the presence of these limitations, three receivers with different de-multiplexing ratios and supply voltages are simulated. The de-multiplexing ratios are chosen according to the different quantizer delays shown in Figure 2.3(b) to meet the same 8Gb/s throughput target, with constant CTLE output bandwidth maintained for all three designs. Figure 2.4 summarizes the power consumption obtained from schematic simulations. Although the power consumption of quantizers and oscillator generally scales down with increased de-multiplexing factor and reduced supply voltage, the CTLE consumes the most power at 0.5V for the reasons discussed above.

This increase in CTLE power consumption nearly cancels all the power savings from scaling VDD from 0.6V to 0.5V. Moreover, comparator offset increases significantly at extremely low voltages [11], necessitating excessive offset cancellation circuitry range. Considering the limited total power savings, corresponding CTLE bandwidth degradation, and the increased susceptibility to variation, reducing supply voltage beyond 0.6V exhibits diminishing returns.

#### 2.3 Receiver with Injection-Locked Oscillator

The receiver consists of an input CTLE that drives eight parallel data quantizers [11], which are each clocked from eight phases generated by an ILRO locked to an eighth-rate forwarded clock from the transmitter chip. While a multi-stage CTLE could potentially provide higher gain and peaking, it has lower bandwidth due to additional pole in the signal path. The single stage CTLE provides 8dB peaking, which is adequate for the 8.4dB attenuation at 4GHz. Alternatively, a multi-stage CTLE could operate at lower supply voltage and still provide the same amount of gain and peaking. However, further reduction in supply voltage may not be applicable since it also affects the robustness of other blocks. Moreover, the energy saving from a lower supply voltage is offset by the addition of another stage that consumes DC current.

Injection locking has been demonstrated as an energy-efficient scheme for both clock generation and de-skewing due to its reduced complexity relative to other approaches such as PLL- or DLL-based timing recovery [3], [12]. In addition, when

ILRO-based de-skew is combined with aggressive supply voltage scaling, excellent receiver energy-efficiency of <0.2pJ/b at 8Gb/s has been demonstrated in a previous work [11].



Figure 2.5: Schematic of ILRO

Figure 2.5 shows the ILRO used in this design, which consists of a 4-stage differential current-starved ring oscillator. The oscillation frequency is controlled by a tail current source that is split into two parts, one controlled by an external frequency-locked loop to nominally oscillate at the forwarded eighth-rate frequency, and the other portion controlled by a 6-bit binary code for de-skew. In order to enable ILRO operation over a wide frequency range, the relative strength between the frequency-tuning current source and de-skewing current sources is adjustable, effectively decoupling the frequency tuning range from the de-skew step resolution. The frequency locking process, which is performed at start-up or during periodic link re-

training, insures that the ring oscillator free-running frequency is at the desired forwarded eighth-rate clock frequency. This also ensures that the ring oscillator operates near the center of the locking range before injection, and has enough tuning range to provide either positive or negative skew.



Figure 2.6: Simulated impact of clock injection approach on phase spacing uniformity

The forwarded differential clock is first buffered and converted to full scale before distributed to the ILRO. It is then injected into two complementary oscillator stages through a coupling capacitor. As shown in the simulation results in Figure 2.6, this fixed-strength AC-coupled injection approach results in a more uniform phase spacing compared to DC-coupled injection schemes that use V/I converters, such as the technique incorporated in [3], while exhibiting a similar locking range.

## **2.4 Experimental Results**

The receiver was fabricated as part of a complete transceiver test chip in a 65nm CMOS GP process. As shown in the die micrograph of Figure 2.7, the total active area for the transmitter is  $214 \times 104 \text{ }\mu\text{m}^2$ , while the receiver occupies  $139 \times 230 \mu\text{m}^2$ .



Figure 2.7: Die micrograph of complete transceiver

Figure 2.8 shows the measured de-skew range of the receiver ILRO versus data rate. When normalized to the clock period, the achievable de-skew range is more than 120° across the entire operating range. Since in the 1:8 de-multiplexing receiver 1UI is 45°, this translates into a de-skew range that exceeds 2UI.

Figure 2.9 shows transceiver energy efficiency measurement results at various data rates and supply voltages. The transmitter and receiver supply is equivalent at 0.6V and 0.65V for 4.8Gb/s and 6.4Gb/s, respectively. In order to achieve 8Gb/s operation, the transmitter requires a slightly higher 0.8V supply, relative to the 0.75V receiver

supply. While the transceiver operates at the lowest voltage at 4.8Gb/s, optimal energy efficiency is achieved at 6.4Gb/s due to the amortization of the static power consumed in the final output line driver.



Figure 2.8: Measured receiver deskew range



Figure 2.9: Measured transceiver energy efficiency

The total transceiver energy-efficiency is 0.47pJ/b, with 0.3pJ/b and 0.17pJ/b efficiency achieved in the transmitter and receiver, respectively. Table 2.1 compares this design with recent energy-efficient serial links. On the receiver side, supply scaling and the use of ILRO have also resulted in significant power efficiency improvement over similar designs with linear equalizer for moderate-loss channel.



Figure 2.10: (a) BER bathtub curve (b) BER vs TX VDD

## 2.5 Summary

A low-voltage wireline receiver for a sub-1mw serial link is presented. For the forwarded-clock receiver, the use of injection-locked oscillator de-skew and a high 1:8 de-multiplexing ratio receiver architecture allows operation at near-threshold supply voltages. Overall, this I/O architecture provides scalable voltage and data rate operation at energy-efficiency levels demanded by future systems.

|                         | [2]                    | [5]            | This Work[13]          |
|-------------------------|------------------------|----------------|------------------------|
| Technology              | 45nm CMOS              | 90nm CMOS      | 65nm CMOS              |
| Supply Voltage          | 0.8V/1.5V              | 1.2V           | 0.6-0.8V               |
| Data Rate               | 10Gb/s                 | 0.5-4Gb/s      | 4.8-8Gb/s              |
| Clocking                | Source-<br>Synchronous | Plesiochronous | Source-<br>Synchronous |
| RX Equalization         | None                   | CTLE           | CTLE                   |
| RX Energy<br>Efficiency | 0.75pJ/b               | 1.3pJ/b        | 0.17pJ/b               |

TABLE 2.1 RX PERFORMANCE SUMMARY AND COMPARISONS

## **CHAPTER 3. ADAPTIVE OPTICAL RECEIVER**

### **3.1 Introduction**

Optical channels provide the potential to overcome key interconnect bottlenecks and greatly improve data transfer efficiency due to their flat channel loss over a wide frequency range and also relatively small crosstalk and electromagnetic noise [14]. Another important feature of optical interconnects is the ability to combine multiple data channels on a single waveguide via wavelength-division-multiplexing (WDM) and greatly improve bandwidth density. In order to take advantage of these attractive properties, silicon photonic platforms are being developed to enable tightly integrated optical interconnects and future photonic interconnect network architectures [15]-[31]. One promising photonic device is the silicon ring resonator [15]-[19], which can be configured either as an optical modulator or WDM drop filter. Silicon ring resonator modulators/filters offer advantages of small size, relative to Mach-Zehnder modulators [20][21], and increased filter functionality, relative to electro-absorption modulators [22].



Figure 3.1: Silicon ring resonator-based wavelength-division-multiplexing (WDM) link

Silicon photonic links based on ring resonator devices provide a unique opportunity to deliver distance-independent connectivity whose pin-bandwidth scales with the degree of wavelength-division multiplexing. As shown in Figure 3.1, multiple wavelengths ( $\lambda_{1-4}$ ) generated by an off-chip continuous-wave (CW) laser are coupled into a silicon waveguide via an optical coupler. This off-chip laser can either be a distributed feedback (DFB) laser bank [32], which consists of an array of DFB laser diodes, or a comb laser [33], which is able to generate multiple wavelengths simultaneously. Implementing a DFB laser bank for dense WDM (DWDM) photonic interconnects (e.g. 64 wavelengths) is quite challenging due to area and power budget constraints. This motivates a single broad-spectrum comb laser source, such as InAs/GaAs quantum dot comb lasers which can generate a large number of wavelengths in the 1100nm to 1320nm spectral range with typical channel spacing of 50-100GHz and optical power of 0.2-1mW per channel [33]. While operating near the common 1310nm wavelength (O-band) does have slightly higher optical loss versus a

1550nm (C-band) system, this has negligible impact in short-reach interconnect applications. After coupling the CW laser light, transmit-side ring modulators insert data onto a specific wavelength through electro-optical modulation. These modulated optical signals propagate through the waveguide and arrive at the receiver side where ring filters drop the modulated optical signals of a specific wavelength at a receiver channel with photodetectors (PD) that convert the signals back to the electrical domain.



Figure 3.2: (a) Top and cross section views of carrier-injection silicon ring resonator modulator, (b) optical spectrum at through port

A basic silicon ring resonator consists of a straight waveguide coupled with a circular waveguide, as shown in Figure 3.2. Input light at the resonance wavelength mostly circulates in the circular waveguide, with only a small amount of optical power observed at the through port, resulting in the ring's spectrum at the through port

displaying a notch-shaped characteristic. This resonance wavelength of the ring device is periodic, repeating over a free spectral range (FSR), and can be shifted by changing the effective refractive index of the waveguide through the free-carrier plasma dispersion effect [18]. Two common implementations of silicon ring resonator modulators include p-i-n junction-based carrier-injection devices [16][17], operating primarily in forward-bias, and carrier-depletion devices [19], operating primarily in reverse-bias. Although a depletion ring generally achieves higher modulation speeds relative to a carrier-injection ring due to the ability to rapidly change the depletion width, its modulation depth is limited due to the relatively low doping concentration in the waveguide to avoid excessive loss. In contrast, carrier-injection ring modulators can provide large refractive index changes and high modulation depths, but are limited by long minority carrier lifetimes.



Figure 3.3: Measured quality factor and resonance wavelength of nine 2.5µm radius silicon ring modulators fabricated on an 8" 130nm CMOS SOI wafer.

While ring-resonator-based photonic interconnects have the potential to offer both improved power efficiency and bandwidth density, reliability and robustness are major barriers to widespread adoption of ring-based silicon photonics [23]. A key challenge is the variation in resonance wavelength with temperature changes and fabrication tolerances. For example, Figure 3.3shows that while a high quality factor is maintained for nine 2.5µm radius ring resonators spread across an 8" 130nm CMOScompatible silicon-on-insulator (SOI) wafer, the 5.48nm resonance wavelength variation implies the need for a potentially wide resonance tuning range for robust operation. In order to relax this, system-level WDM channel-shuffling techniques are proposed that reduce the tuning to the order of FSR/N, where N is the WDM channel number [23][28]. A commonly proposed resonance wavelength tuning technique is to adjust the device's temperature with a resistor implanted close to the photonic device to heat the waveguide, thus changing the refractive index [29][30]. One potential issue with this approach is that the tuning speed, which is limited by the device thermal time constant (~ms), may necessitate long calibration times. Also, tuning power overhead can degrade overall link power efficiency [25][30].

Achieving reliable and efficient operation in silicon photonic interconnect systems with large variations in link budget components, such as photonic device properties and interface parasitics, is another important consideration. The link budget determines the receiver sensitivity, with various front-end circuits proposed for optical interconnects, such as regulated-cascode transimpedance amplifiers (TIA) [34][35],
feedback TIAs [29][31][36], and integrating topologies [37][38]. In the presence of variations, excessive sensitivity margins are often maintained for each channel to satisfy bit-error rate (BER) under worst case conditions. Having individual scalability for each channel reduces necessary margins, and therefore power consumption. One efficient approach to optimize receiver power efficiency versus data rate is to utilize supply-scaling with CMOS inverter-based feedback TIAs [36]. However, in order to leverage this approach for large channel-count systems, efficient control loops with per-receiver voltage regulators are required that allow for self-adaptation to the desired data rate and link budget conditions.

While efficient clocking architectures for receiver-side data retiming and deserialization are often neglected in optical interconnect designs [31][36], they are necessary to form a complete link. One approach is to utilize a continuously-running clock-and-data recovery (CDR) system [37] which allows the potential for plesiochronous operation between the transmitter and receiver. However, this generally consumes more power and area relative to mesochronous architectures which only require periodic training to optimize the receiver sampling position [29]. For mesochronous architectures, key considerations include achieving efficient receiver-side clock generation and sufficient jitter tracking of the incoming data to achieve the desired BER. Applying a forwarded-clock architecture, commonly used in electrical I/O systems [13][39], in a photonic WDM system offers the potential for improved high frequency jitter tolerance with minimal jitter amplification due to the clock and data signals experiencing the same delay over the common low-dispersive optical channel.

This section is organized as follows. The architecture of the transceiver circuits prototype is outlined in Section 3.2. An optical forwarded-clock adaptive sensitivity-power receiver that accommodates variations in input capacitance, modulator/photodetector performance, and link budget is proposed in Section 3.3. Experimental results of the electrical transceiver circuits prototype, fabricated in a 65nm CMOS technology are then presented.



Figure 3.4: Photonic transceiver circuits prototype block diagram

#### **3.2 Optical Transceiver Architecture**

Figure 3.4 shows a block diagram of the CMOS photonic transceiver circuits prototype, with six transmitter and five receiver modules integrated in a 2mm2 65nm CMOS die. At the transmitter side, a half-rate CML clock is distributed to the 6 transmitter modules where 8-bit parallel data is multiplexed to the full output data rate before being buffered by the modulator drivers. Two versions of the drivers are implemented. A differential driver, with approximately 0V average bias level, provides a 4Vpp output swing to allow for high-speed operation, while a single-ended driver provides a 2Vpp output swing on the modulator cathode and utilizes a biastuning DAC on the anode for an adjustable DC-bias level. These drivers are wirebonded to carrier-injection silicon ring resonator modulators, where continuous wavelength light near 1300nm from a tunable laser is vertically coupled into the photonic device's input port. The modulated light is then coupled from the modulator's through port into a single-mode fiber for routing to the bias-based tuning photodetector used to stabilize the resonant wavelength and to the optical receiver modules for high-speed data recovery. At the receiver side, data is recovered by adaptive inverter-based TIA front-ends that trade-off power for varying link budgets by employing on-die eye monitors and scaling the TIA supplies for the required sensitivity. The receive-side sampling clocks are produced from an opticallyforwarded quarter-rate clock which is amplified by a fixed-supply TIA before being passed to an injection-locked oscillator which produces four quadrature clocks that are routed to the four receiver data channels.



**3.3 Optical Forwarded-Clock Adaptive Receiver** 

Figure 3.5: Adaptive sensitivity-power data receiver

As shown in Figure 3.5, the data-channel receivers consist of an inverter-based TIA front-end followed by a bank of four quadrature-clocked comparators whose offsets are digitally calibrated to optimize receiver sensitivity. The quadrature sampling clocks, generated from an optical forwarded-clock receiver, are passed through a local digitally-controlled delay line for timing margin optimization and phase-spacing calibration. An additional parallel comparator with a 6-bit programmable threshold is introduced that serves as an eye monitor, setting the minimum voltage margin needed

to correctly slice the input signal for a required bit-error rate. By comparing its output with the normal data comparator on the same clock phase, eye-closure can be detected before a bit-error actually occurs. This information is used to control a 6-bit R-2R voltage DAC that sets the LDO-generated TIA supply voltage to the minimum level required to achieve the sensitivity and bandwidth for a given bit-error rate.



Figure 3.6: Inverter-based TIA front-end: (a) schematic, (b) simulated TIA commonmode output response to a 5mV power supply step

Figure 3.6 shows the TIA front-end [26], which consists of three inverter stages with resistive feedback in the first and third stages. These inverter stages are biased around the trip-point for maximum gain with an offset control loop that subtracts the average photocurrent from the input node. The front-end's power supply level has a significant impact on gain, bandwidth, and noise performance, allowing for an efficient mechanism to trade-off receiver sensitivity with power consumption. However, excessive fluctuations can result in the front-end output common-mode

variation if a simple single-ended low-pass filter is used in the offset control loop, which can impact overall receiver sensitivity. In order to reduce this common-mode variation, the feedback RC filter capacitor is split into equal decoupling to ground and the adaptive supply. The RC filter bandwidth is set to be 150 kHz, which is estimated to support a 2<sup>16</sup>-1 PRBS pattern at 10Gb/s. If a system is required to support longer run-length data patterns, techniques similar to the low-frequency equalizers [34] and baseline-wander correction circuits [41] used in electrical links offer potential solutions.



Figure 3.7: Optical receiver sensitivity-power adaption algorithm

A differential transconductance stage then amplifies the difference between this filtered node and half the adaptive supply to produce the offset correction current. This reduces the output common-mode disturbance with a 5mV power supply step from

92mV with a simple single-ended low-pass filter to 1.5mV with the adaptive-supply referenced implementation.

The optical receiver sensitivity-power adaptation is done partially with a softwarecontrolled outer loop that monitors the bit-error rate and adjusts the voltage margin with the eye-monitor comparator threshold through a serial test interface, and an onchip state machine that scales the front-end power supply level. Figure 3.7 summarizes the eye monitor and supply scaling state machine. The adaptation algorithm captures two consecutive bits D1 and D2, and proceeds only with a '01' pattern for the worst case ISI condition. Next, the data comparator output (D2) is compared with eye monitor output (D2') on the same clock phase, and an error is recorded if there is a difference. After a certain amount of total bits, a decision is made to reduce the power supply if no error is observed, or increase the power supply if the error rate exceeds a preset threshold. In order to minimize dithering without the overhead of a large averaging counter, the power supply doesn't change if the error rate is below a certain threshold.

Figure 3.8 shows a block diagram of the clock receiver, which utilizes the same inverter-based TIA front-end, but with a constant 1V supply for minimal jitter. The TIA output is amplified to full CMOS levels by a multi-inverter stage main amplifier (MA) that also contains a duty-cycle control loop. Global skew adjustment between the clock and data channels is achieved by a subsequent digitally-controlled delay line, which provides approximately 130ps de-skew range. This single-ended clock is then

converted from singled-ended to differential full-rail signals for injection by ACcoupling into a two-stage differential oscillator that generates the quadrature clocks that are distributed to the four data receiver channels. The ILO has dummy injection buffers to reduce the quadrature mismatch caused by the differential injection, with post-ILO per-phase tunable delay buffers providing additional skew compensation. A relatively wide injection locking range of ~100MHz is achieved, with the free-running frequency set manually in this prototype via tuning of the tail current source. While not implemented in this prototype, a periodically-activated control loop could set the ILO free-running frequency equal to the injection clock [42] to reduce quadrature phase errors and provide increased robustness to PVT variations.



Figure 3.8: Optical clock receiver

# **3.4 Experimental Results**

A test chip consisting 5 complete transceivers was fabricated in TSMC 65nm GP process. The wirebonded die with photodiodes for receiver testing is shown in Figure 3.9.



Figure 3.9: Test chip wirebonded to photodiodes for receiver testing

In order to characterize the optical performance of the data receiver, an externallymodulated laser source is vertically coupled to a 150fF Cosemi LPD3012 photodiode which is wirebonded to the receiver input. This photodiode displays 1.0A/W responsivity at 1310nm. The Figure 3.10 BER measurements with an Anritsu MP1800A signal quality analyzer show that when the nominal 1V front-end power supply is utilized, a sensitivity of -9dBm is achieved at 8Gb/s for a BER=10-9 with a 27-1 PRBS data pattern. Relaxing the input sensitivity by  $\sim$ 2 dB with increased optical input power enables the adaptive TIA supply to decrease by 4%, resulting in a 14% reduction in TIA power.



Figure 3.10: 8Gb/s receiver measurements: (a) BER versus input optical power with a 1V power supply. (b) Sensitivity (BER = 10-9) and power versus TIA supply

As the data rate and BER performance of the current optical characterization are limited by ~1.5mm bondwires and ~200fF total capacitance, an on-chip current source (Figure 3.11) is used to emulate a high-speed waveguide photodetector capable of being tightly integrated with the optical receiver, either in a monolithic manner [28][31] or with microsolder bonding [29]. This test structure allows for receiver benchmarking and motivates future planned prototypes with microbump integration and Ge waveguide photodetectors in the same 130nm SOI photonics process as the ring modulators/filters. Note, an improved version of this photodetector emulator

circuit would also include programmable input capacitance values to investigate the impact of different integration approaches.



Figure 3.11: Integrated photodetector emulator circuit

Figure 3.12 shows that this enables operation at a higher data rate of 10Gb/s with an improved sensitivity of -18dBm at a BER=10-12, assuming a unity responsivity. This on-chip test setup also enables a wider range of supply scaling, with the automated control loop reducing the TIA power ~40% as the input current is scaled from 16 to  $60\mu$ A with a 50-100mV eye monitor margin. Refining the control state machine and using a more aggressive margin level could potentially achieve even more power savings, as overriding the automated control loop yields ~60% power reduction. The overhead of the eye monitor comparator and adaptation logic is estimated to be  $160\mu$ W at 8Gb/s. This overhead can be further reduced be either stopping the operation of the eye monitor after adaptation or only activating it periodically.



Figure 3.12: 10Gb/s receiver measurements with photodetector emulator circuit: (a) BER versus input current with a 1V power supply. (b) TIA power scaling versus input current

A similar optical test set-up is used to characterize the optical clock receiver. An optical clock signal in amplified by the clock receiver and quadrature clocks are generated by the ILO, with one of the 2GHz quadrature clocks used for the 8Gb/s data receiver clocking shown in Figure 3.13. The recovered clock jitter performance is a function of the input clock jitter and power, with the clock path introducing an additional 0.25psrms jitter for -12dBm input power and able to generate sub-2psrms total jitter down to -16dBm.

Table 3.2 shows the receiver performance comparison with previous works. While the optical receiver test configuration contributed to a dramatically higher input capacitance, a superior energy efficiency of 275fJ/bit is achieved with the adaptive power-sensitivity receiver. Relative to a 32nm electrical IO design optimized for moderate data rates and channel loss [42], the combined energy efficiency of the proposed 65nm optical transceiver circuits is comparable at near 1pJ/b. This provides strong motivation to leverage this photonic I/O architecture in a WDM system with multiple ~10Gb/s channels on a single waveguide, as state-of-the-art 40Gb/s electrical transceivers consumer near 40pJ/b [43].



Figure 3.13: Optically forwarded-clock receiver measurements: (a) 2GHz recovered clock waveform, (b) jitter versus input optical power

## 3.5 Summary

An adaptive optical receiver is presented along with an injection-lock based forwarded-clock receiver. The receiver is designed as part of a complete transceiver for a WDM photonic link. The receiver is measured at 8Gb/s with optical input and 10Gb/s with an emulated electrical input. Compared with previous works, a competitive energy efficiency of 275fJ/bit is achieved by the proposed receiver.

|                                                                                                                       | This Work                                                             | [29]                 | [31]                | [42]                            | [43]                            |
|-----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|----------------------|---------------------|---------------------------------|---------------------------------|
| Technology                                                                                                            | 65nm                                                                  | 40nm                 | 130nm SOI           | 32nm                            | 40nm                            |
| Input Cap                                                                                                             | >200fF                                                                | 40~60fF              | 20fF                | N/A                             | N/A                             |
| PD Responsivity                                                                                                       | 1A/W                                                                  | 0.7A/W               | 0.8A/W              |                                 |                                 |
| <b>RX Sensitivity</b><br>Optical input data<br>Optical input clock<br>Electrical input data<br>Electrical Input clock | -9dBm@8Gb/s<br>-18dBm@2GHz<br><17uA@10Gb/s<br><8uA@2.5GHz             | -15dBm<br>@10Gb/s    | -6dBm<br>@25Gb/s    | $20 \mathrm{mV}_{\mathrm{ppd}}$ | $20 \mathrm{mV}_{\mathrm{ppd}}$ |
| <b>Power</b><br>TIA<br>Comparators/other                                                                              | 1.42mW<br>(0.18pJ/b)<br>0.78mW<br>(0.10pJ/b)                          | 3.95mW<br>(0.40pJ/b) | 48mW<br>(1.92pJ/b)  | 4.40mW<br>@ 8Gb/s<br>(0.55pJ/b) | 1050mW<br>(23.5pJ/b)            |
| Area<br>Clock RX<br>Data RX                                                                                           | $\begin{array}{c} 0.032 \text{mm}^2 \\ 0.036 \text{mm}^2 \end{array}$ | 0.008mm <sup>2</sup> | 0.48mm <sup>2</sup> | 0.02mm <sup>2</sup>             | 3.9mm <sup>2</sup>              |

 TABLE 3.2 COMPARISON WITH PREVIOUS WORK

# CHAPTER 4. LOW-VOLTAGE DECISION-FEEDBACK EQUALIZER

### 4.1 Introduction

Equalization is an important part in communication systems. Signal from a transmitter must travel through a channel, or, in some cases, several sections of channels before it reaches a receiver. For wireless communication, the channel could be free space with certain reflection characteristics; for optical communication, it could be optical fiber; for electrical communication, it could be a telephone line, an Ethernet cable, or copper traces on a Printed Circuit Board (PCB). These channels all have different capacities, or the maximum bandwidth at which data can be transferred at a specified accuracy. Real-world channels all have limited capacities, and it is desirable to maximize the capacity of a given channel. Most channels in serial communication exhibits low-pass behavior, characterized by the 3-dB bandwidth. After passing through such a channel, signal can be significantly distorted and attenuated at the receiver side, affecting the bit-error-rate (BER) of the link.

Equalization extends the bandwidth of a low-pass channel by inserting a high-pass filter into the signaling path, resulting in a flatter overall channel response. This high-

pass filter could be placed on either the transmitter side (TXEQ) or the receiver side (RXEQ). Transmitter side equalizer is usually implemented as a finite impulse response (FIR) filter. The drawbacks of TXEQ include: (1) The coefficients of the FIR filter need to be adapted to optimize the quality of the received signal. However, the TX cannot know the quality of the received signal, so a back channel is usually required to retrieve this information from the RX. (2) The maximum output amplitude is limited by TX supply voltage. The TX EQ cannot boost the high-frequency component of the output past this limit; rather, it suppresses the low-frequency component. In other words, TX EQ can compensate for the distortion from the channel, but not the attenuation.

RX EQ does not have the drawbacks of TX EQ as mentioned above, although it comes its own set of challenges. The rest of this chapter focuses solely on RX EQ.

This chapter is organized as follows. Section 4.2 provides a brief review of CTLE and discusses its limitation in voltage scaling. Section 4.3 describes the principle of DFE and its building blocks. A novel dynamic latch is proposed in Section 4.4 for low-voltage, low-power operation. Basing on this latch, a complete DFE is implemented and presented in Section 4.5, followed by experimental results and summary.

# 4.2 Continuous-Time Linear Equalizer (CTLE)



Figure 4.1: Schematic of a conventional CTLE

A conventional single-stage continuous-time linear equalizer is shown in Figure 4.1. Its transfer function can be written as:

$$H(s) = \frac{g_m R_L}{1 + \frac{g_m R_S}{2}} \cdot \frac{1 + \frac{s}{\omega_z}}{\left(1 + \frac{s}{\omega_{p1}}\right) \left(1 + \frac{s}{\omega_{p2}}\right)}$$
(4-1)

where

$$\omega_z = \frac{1}{RsCs} \tag{4-2}$$

$$\omega_{p1} = \frac{1 + \frac{g_m R_s}{2}}{R_s C_s} \tag{4-3}$$

$$\omega_{p2} = \frac{1}{R_L C_L} \tag{4-4}$$

Assuming that the combined voltage drop across M1 and the current source is Vx,  $R_L$  and  $I_d$  can be related as following:

$$I_d R_L = V_{DD} - V_x \tag{4-5}$$

Using a square-law approximation, the peak gain can be written as:

$$A_{peak} = g_m R_L = \sqrt{2\mu C_{ox} \frac{W}{L} (V_{DD} - V_x) R_L}$$
(4-6)

This peak gain increases with  $R_L$  for given headroom. The maximum value of  $R_L$  is limited by the bandwidth at the output node:

$$\omega_{p2} = \frac{1}{R_L C_L} = k \cdot \omega_{Nyquist} \tag{4-7}$$

where k indicates the distance from the second pole to the Nyquist frequency. Equation (4-6) indicate that for a given transistor size, reducing  $V_{DD}$  also reduces the

40

CTLE peak gain. What's more, the loading of the next stage,  $C_L$ , usually increases as transistor sizes are increased to compensate for the lower  $g_m$  at lower voltage. This leads to reducing of  $R_L$  to maintain the same bandwidth, as indicated in equation (4-7). This is because at lower  $V_{DD}$ , a higher amount of sampler time-interleaving is required, adding to the CTLE load capacitance  $C_L$ . In order to meet the bandwidth requirement,  $R_L$  must be reduced accordingly. Since both  $V_{DD}$ -Vx and  $R_L$  decreases with  $V_{DD}$ , so does the peak-gain  $A_{peak}$ .

Although the peaking factor  $1+g_mR_S/2$  is not directly affected by  $V_{DD}$  scaling, the reduced peak gain limits the CTLE output swing. Note that the peak gain can potentially be boosted by using larger device sizes. However, pushing this too far negatively affects  $A_{peak}$  as  $r_{OUT}$  approaches  $R_L$ , and  $C_L$  becomes dominated by CTLE self-loading.

To better understand the effect of supply voltage scaling on CTLE, simulation results at different supply voltages are shown in Figure 4.2. Note that for different supply voltages, different demultiplexing ratios N are assigned to reflect loading from the following stage. The value of N at a particular supply voltage point is determined as follows: first, the samplers that follow the CTLE are characterized to determine the maximum operating speed. Then an appropriate N is chosen so that the combined throughput of the samplers meet requirement. Throughout the simulation, k and W/L are kept constant. It can be observed that, for the same demultiplexing ratio N, power consumption almost scales linearly, while peak gain decreases at a lower rate.

However, once N increases to compensate for the rise of sampler delay, peak gain drops significantly, and the increase in  $I_d$  shadows the scaling of  $V_{dd}$ , resulting in higher power consumption. At 0.5V, the CTLE provides the lowest peak gain, while consuming the highest power.

As shown above, as a classic analog component, CTLE suffers from tightly coupled trade-offs of gain, bandwidth, peaking and power consumption, etc. Using inductor shunt peaking [44] could extend the bandwidth without incurring power penalty, but on-chip inductors requires large area, and thus may not be available for transceiver designs intended for dense interconnects. The analog nature also means that we cannot expect the capability of CTLE to be significantly improved from each new generation of technology, which are optimized for digital circuits.

To maximally benefit from latest and future technology nodes, it is desirable to look into equalizers that are more digital in nature. Another popular type of equalizer, the Decision-Feedback Equalizer (DFE), is intrinsically digital. It will be discussed in the next section.



Figure 4.2: Simulation of CTLE performance scaling with supply voltage

# 4.3 Decision-Feedback Equalizer (DFE)

In time domain, the effect of channel loss manifests as inter-symbol interference (ISI). After going through a lossy channel, a pulse with the length of 1 symbol spreads into other symbols, as shown in Figure 4.3. When a sequence of symbols are transmitted, that portion of the pulse response that extends outside 1 symbol adds to or subtract from adjacent symbols. To better illustrate this effect, the pulse response is sampled, and the samples after the main symbol are usually referred to as post cursors, like a1, a2, a3... in this case. The DFE can remove the ISI from post cursors by

removing the residue response from previous symbols from the incoming current symbol. In other words, after previous decisions are made, they are fed back to the input to cancel the ISI, and thus the name.



Figure 4.3: Principle of DFE

Compared to a linear equalizer, DFE has several advantages. First, it can have multiple feedback taps, each set to cancel corresponding post cursors. It can also be used to cancel channel reflections that are usually multiple symbols away from the main pulse. For many applications, like backplane, the channel can consist multiple sections with various loss characteristics and impedance discontinuities. As a result, the channel response cannot be simply characterized with multiple poles, and it is very difficult to compensate for the frequency response with linear equalizers. Secondly, DFE doesn't suffer from noise amplification like linear equalizer. Because a digital decision is made before fed back to input, input noise is rejected and doesn't propagate through the feedback loop. This is important because the density of parallel links is going up to meet the growing bandwidth requirement, and cross-talk noise power is located at high frequency.

Another reason for this work to look into DFE for low-power equalization solution is that DFEs are largely digital in nature. As current and future technology nodes are all optimized for digital, DFEs stand to benefit more from advanced processes.

DFEs are traditionally considered to be power-hungry. The summer consumes a large part of the power because its bandwidth has to be high enough so that it can settle in less than one unit interval (UI) for the following slicer to have enough time to make a decision. The efficiency of the summer can be vastly improved by using an integrating summer, first proposed in [45]. For its superior energy efficiency, an integrating summer suitable for low-voltage operation is adopted in this design, described later.

Apart from the summer, the other major contributors to DFE power are clocking and slicers/latches. Clocking can account for a significant part of total power if other building blocks are sufficiently efficient. In this design, we opt for CMOS clocking. Not only does it only consume dynamic power, but its power consumption also scales quadratically with supply voltage. Coupled with a 0.7V target supply, the clock power is reduced by half compared to that under nominal supply. The bandwidth requirement of clock distribution is relaxed by adopting a 1:4 demultiplexing architecture so only quarter-rate clock is distributed.

The optimization of latch power consumption will be discussed in the next section.

#### 4.4 Charge-Based Dynamic Latch

As discussed in the previous section, the speed of a direct feedback DFE is limited by the delay of the critical path. In this section, we look into how to reduce the time to make a decision, which constitutes a large part of the critical path delay.

Shown in Figure 4.4 is a two-stage, regenerative latch first proposed by Schinkel in 2007. [10] It operates as follows. When clock CK is low, the intermediate nodes are reset to VDD, and the output nodes are grounded. When CK goes high, the intermediate nodes get pulled down at different rate depending on input. This voltage difference is amplified by the second stage with positive feedback until it reaches full swing. The bottom half of Figure 4.4 shows the voltage waveforms of different nodes of the Schinkel latch during operation, highlighting its dynamic nature. This latch has a small sampling aperture time which is only determined by the falling speed of the first stage. But the second stage uses regeneration to reach full swing, and therefore

has a large delay. This leads us to consider the possibility of reducing the output swing in exchange for shorter delay.



Figure 4.4: Schematic and operation of the two-stage dynamic latch proposed by Schinkel

The latch output goes into 1 of the 3 feedback taps in the summer, which is shown in Figure 4.5. It steers the current in the differential pair one way or the other to subtract the ISI from the current input depending on the decision of the previous bit. A full decision requires the differential pair to be saturated, with all the current only flowing on one side. If the differential pairs are not saturated, the summer loses its non-linearity since the amount of feedback becomes proportional to the input. The DFE with a somewhat linear feedback can still perform equalization. This is sometimes also called a soft decision, as proposed in [46]. This technique can be exploited to reduce the critical path delay, but results in the DFE to behave more like a linear equalizer and suffers from the same drawbacks such as noise amplification. Fortunately, saturating a differential pair does not require a full swing input. To understand what input swing is required, we take a look at the summer where the latch output is fed into, as shown in Figure 4.5. With small input amplitude, the differential pair has linear gain. As the input amplitude increases, the transistor on the weak side turns off, and all of the current flows through the strong side. Depending on the overdrive voltage of the differential pair, it can be saturated with only a few hundred of mVs. This means that a non-rail-to-rail input swing is acceptable for the summer. [47]



Figure 4.5: Required latch output swing into the summer

With this in mind, we propose a charge-based latch as shown in Figure 4.6. We keep the first stage design, which is already fast, and use it for the second stage too. When CK goes high, the second stage starts discharging at the same time with the first stage. But if the first stage discharges faster, the second stage will be turned off before it is fully discharged, and the differential input is amplified and preserved at the output. Without regeneration, the second stage is now much faster. Compared to other low output swing latches like the CML latch, this design is fully charge-based [48] and consumes only dynamic power. This means that the energy consumed per operation

also scales quadratically with supply voltage, making it more attractive for low-vdd. We note that the use of a similar structure has been reported for an ADC. [49]



Figure 4.6: Schematic and operation of the proposed charge-based latch

The speed advantage of this charge-based latch can also be understood by looking at the headroom for each transistor, as shown in Figure 4.7. During the reset phase, when CK is low, the bottom NMOS transistors are turned off, and only the PMOS transistors are on to pull the output to VDD. When CK goes high and the NMOS differential pair turns on, its drain voltage starts at VDD and its source voltage gets pulled to near ground. This makes the latch effectively a "one-stack" circuit. Therefore, the active transistors could have ample headroom and larger gm. Having only one stack also enables the supply voltage to be further scaled down.



Figure 4.7: The one-stack nature of proposed charge-based latch

### 4.5 Prototype DFE Implementation

This section describes the architecture and timing of the implemented prototype DFE using proposed charge-based latch, as well as key building blocks like S/H and summer.

### 4.5.1 DFE Architecture and Timing

In order to relax the speed requirement and mitigate the lower circuit performance at low supply voltage, 1:4 demux is performed directly at the input so that all circuits can be clocked at quarter-rate. The lower clock frequency also allows the use of CMOS clocking, which helps reduce power consumption because its power scales quadratically with supply voltage. As shown in Figure 4.8, the incoming data Din is sampled by four time-interleaved S/H circuits. Each sampled input then enters an individual summer where the ISI is removed. There are 3 feedback taps in each summer that comes from the latched output of previous decisions. The number of feedback taps is chosen partly to show the advantage of direct feedback architecture. Because although 2-tap loop-unrolling DFE is relatively common, going to 3-tap unrolling would require 32 slicers, which would add significant power and area. To the best knowledge of the author, 3-tap loop-unrolling DFE has only been reported in [50].



Figure 4.8: Architecture of the proposed DFE

The timing of the DFE is illustrated in Figure 4.9. Quadrature clock phases with 50% duty-cycle drive all the S/H, summers, and latches. To understand the timing constraint of the critical path, we start at the input when it is sampled by the S/H. This introduces a delay of  $td_{S/H}$ . The sampled input then goes to the summer, where it is integrated over the period of  $td_{summer}$ . Finally, the summer output is sampled by the following latch, which takes  $td_{C-Q}$  to generate the decision to be fed back to the next summer. To satisfy the critical timing constraint, we would normally have:

$$td_{S/H} + td_{summer} + td_{C-Q} < 1UI$$
(4-8)

However, we notice that the output of the S/H and the latch from previous phase go into the summer at the same time. In other words, if we overlap the delay of the S/H and the latch, we can relax the critical path timing constraint:

$$td_{S/H} + td_{summer} + td_{C-Q} < 1UI + td_{S/H}$$
(4-9)

The other challenges in the design of this DFE are the S/H and the summer, which will be discussed in following sections.



Figure 4.9: DFE timing

## 4.5.2 Charge-Based S/H

Another challenge in this design is the S/H. It must have enough bandwidth for the target data-rate, and have a small enough sampling aperture compared to 1 UI. Unfortunately, both bandwidth and aperture time degrades significantly at lower VDD.

For the commonly used pass-gate S/H circuit, the on resistance is a strong function of gate overdrive voltage and therefore supply voltage. Simulation of a PMOS pass-gate (Figure 4.10) shows that, as supply drops from 1V to 0.7V, its on-resistance nearly doubles, which means its bandwidth is reduced by half. Also, its aperture time depends on the rise or fall time of the clock, which also increases at lower VDD.



Figure 4.10: Simulated relationship between on-resistance of a PMOS pass-gate S/H and supply voltage

Given the high speed and small aperture time of the previously proposed chargebased low-swing latch, we realized that we may also make it function as a S/H. Several modifications are made: cascode transistors are added to the second stage to reduce common-mode drop because the summer main tap requires a higher commonmode voltage. The cascode devices are connected to the input, and would store residual ISI at the cascode node. So a PMOS device is also added to short the cascode nodes during latch reset. (Figure 4.11)



Figure 4.11: Schematic of proposed charge-based S/H circuit

Figure 4.12 shows the simulated Impulse Sensitivity Function (ISF), which indicates the sampling aperture time. [51][52] Narrower impulse indicates a shorter aperture time. Both the charge-based S/H and latch achieve an aperture time of around 17ps, which is much smaller than 1 UI at 16Gb/s. The S/H and the latch use the same

first stage, except for sizing, which explains the similarity of their aperture time. Their gain characteristic is shown in Figure 4.13. The S/H, indicated by the black curve, has unity gain and shows good linearity. The latch has a larger gain of about 2, and as a result the output saturates at larger input amplitude. As a side note, the gain of the S/H can designed to be greater than 1 to provide some amplification. But in this design it is kept at 1 to preserve the linear range of the following summer.



Figure 4.12: Simulated ISF of proposed charge-based latch and S/H

### 4.5.3 Integrating summer with common-mode restoration (CMR)

Design of the summer is the next challenge in this DFE design. With reduced supply, it is more difficult to maintain its dynamic range and linearity to allow for accurate ISI cancellation.



Figure 4.13: Simulated gain of proposed charge-based latch and S/H

An integrating structure [45] is chosen for the summer in this DFE, as shown in Figure 4.14. Integrating summer consumes less power than a resistor loaded continuous-time summer for the same gain and bandwidth. The summer has a main input tap, linearized by source degeneration. This reduces the gain of the summer, which is undesirable because it means following latch must spend more time to reach sufficient output swing. Another way to improve linearity without this gain penalty is to bias the transistors in the differential pair with larger overdrive voltage. This makes the main tap more linear at the cost of higher power consumption. However, due to the reduced headroom of this design, the overdrive voltage is limited and relying on this
approach alone doesn't result in a good enough linearity. As a result, a combination of both approaches are adopted.

There are three feedback taps for ISI cancellation, and another tap to cancel the offset. Note that this takes into account the offset of the whole receiver signal path including the S/H, the summer, and latches. This calibration is done at start-up via scan-chain. During measurement, it is observed that having this offset cancellation ability significantly improves the BER. As discussed earlier, the biasing of the feedback taps are set for lower saturation voltage. In this design, the feedback tap transistors have an overdrive voltage of around 150mV.



Figure 4.14: Schematic of the integrating summer

Performance of the summer can suffer from insufficient headroom. This is illustrated in Figure 4.15. The Current of all differential pairs are summed at the output nodes, which start at VDD and falls as integration continues. During the integration period, the differential pairs should be kept in saturation so that their output impedance remains high. If the output node voltage drops too low and the differential pairs enter triode region, the differential gain would be reduced due to lower impedance. It also affects the linearity of the summer, causing incomplete ISI cancellation. This issue is more pronounced at lower VDD, where the integration headroom is directly reduced.

One way to address the headroom issue is to restore the summer common-mode level, meaning the outputs are raised by the same amount, and the differential gain remains unchanged. This can be achieved by adding a pair of current sources that inject common-mode current into summer output. As shown in Figure 4.16, after common-mode restoration, the differential pairs can stay in saturation, and linearity is preserved. Because this only changes the common-mode of the output, the differential gain of the summer does not suffer from regression. However, there is a problem with this seemingly solution.



Figure 4.15: Headroom issue of integrating summer at lower supply voltage

Figure 4.17 shows the problem when using simple PMOS current sources for injection. When integration starts, the summation nodes start at VDD. The PMOS current sources do not have enough headroom, and operate in triode region. As a result, they appear as low impedance to the summation nodes. The reduced impedance at the integrating node degrades the summer differential gain and also its linearity, and directly contrasts the goal of common-mode restoration.



Figure 4.16: Common mode restoration of integrating summer

Several approaches have been published to address this issue. In [53], a coupling capacitor is inserted between the injection current source and the summation node. The capacitor isolates the DC biasing point so that the current source output is set at a lower voltage than the summation point to stay in saturation, while still providing the boosting through coupling. One drawback of this approach is that the coupling capacitor could be fairly large (50fF in the reference), and increase the total area of the

summer. This in turn results in larger parasitic capacitance, which degrades the summer performance. In this work we propose a common-mode restoration circuit with bootstrapped current source that requires much smaller area overhead.



Figure 4.17: Issue with using PMOS current source for CMR

To address the headroom issue without affect the linearity of the summer, a chargepump based CMR is proposed. The schematic is shown in Figure 4.18. The PMOS transistor M2 is connected to summer output and acts as a current source. Instead of drawing current from VDD, its current is provided by a capacitor. When the summer is being reset, the bottom plate of the capacitor is grounded, while the top plate is charged to VDD by transistor M1.



Figure 4.18: Proposed circuit for summer CMR

When integration starts, M1 turns off, and the bottom plate of the capacitor is switched to VDD. This bootstraps top-plate voltage beyond VDD, thus providing sufficient headroom for M2 to stay in saturation, even if the summer output node is near VDD.

By keeping M2 in saturation and its output impedance high, the proposed bootstrapped current source causes minimal gain and linearity degradation to the summer. By using high-density varactor for the capacitor, the common-mode restoration circuit only adds 32 um<sup>2</sup> to each summer. This allows for a compact layout, which is critical in reducing parasitics for high-speed design.

The effectiveness of the proposed common-mode restoration circuit is verified in simulation. The top half of Figure 4.19 shows the summer common-mode voltage with and without restoration. Note that the summer is running at quarter-rate, and is sampled at about 1UI after integration starts. The proposed circuit provides more than 200mV boost of common-mode voltage. The lower graph shows the summer differential output voltage. The two waveforms almost overlap, until the reference summer without common-mode restoration eventually enters triode region and its differential gain falls off.



Figure 4.19: Simulation of summer CMR

To better illustrate the DFE operation, simulated output waveforms of different stages are shown in Figure 4.20. The 16Gb/s incoming data is generated by a 4-tap FIR filter. The three post-cursors are 0.75X, 0.5X and 0.25X of the main cursor, respectively, resulting in a total ISI of 1.5X of main cursor. Total input amplitude is 380mV peak-to-peak. Because the ISI amplitude is larger than the main tap, the input eye is completely closed and cannot be correctly recovered without equalization. The input is then demuxed down to quarter rate of 4Gb/s by the S/H, which preserves the ISI information with good linearity. The amplitude after S/H is also 380mV. The ISI is almost completely removed at the summer, as can be observed from the clean eye. The amplitude of summer output is 250mV. The following latch samples and amplifies the summer output, and provides the decision to the next summer. In this simulation, the output amplitude of the first latch is 600mV, more than enough to saturate the feedback differential pairs in the summer.



Figure 4.20: Simulated waveforms (a) 16Gb/s input (b) 4Gb/s S/H output (c) 4Gb/s summer output (d) 4Gb/s latch output

# **4.6 Experimental Results**

A test chip was fabricated in 65nm General Purpose process. The chip occupies 1mm by 0.8 mm (Figure 4.21), while the DFE core area, including the S/H, summers, common-mode restoration, and latches, is only 60 by 60 um (Figure 4.22). This compact layout helps minimize parasitics and improve bandwidth.



Figure 4.21: Die micrograph of the DFE test chip





Figure 4.22: Layout of the DFE core

The DFE was tested on a probe station in the setup shown in Figure 4.23. PRBS generated by a Tektronix BSA-260C BERT passes through a test channel board with various channels, and is applied at the DFE input through a high-speed probe. This way the quality of the input signal can be precisely characterized and controlled. The DFE is driven by quadrature clocks generated from on-chip clock divider. An on-chip buffer is connected to one of the summers to allow for probing of the internal eye.



Figure 4.23: DFE test chip measurement setup

A channel with 13dB lost at 8GHz Nyquist frequency is used in this measurement. The eye at DFE input is critically closed, shown in Figure 4.24. Measured BER bathtub curves with various taps of feedback are shown in Figure 4.25. With this channel, the DFE operates at 0.65V with an eye opening of 53% UI.



Figure 4.24: 16Gb/s input eye diagram after a 13dB loss channel



Figure 4.25: Measured BER bathtub curve with the 13dB loss channel



Figure 4.26: 16Gb/s input eye diagram after a 18dB loss channel



Figure 4.27: Measured BER bathtub curve with the 18dB loss channel

The DFE is also measured with another channel with higher loss. With 18dB loss at Nyquist, the input eye is completely closed, and cannot be recovered with less than 2 taps of equalization. With 3 taps turned on, the DFE achieves an eye opening of 46% UI. A higher supply voltage of 0.7V is required, mainly to extend summer dynamic range in the presence of more ISI.

Operating at 16Gb/s, the DFE consumes 4mW from 0.7V supply, of which 1.4mW is consumed in summers and DACs, 1.5mW in clocking, and 1.1mW in latches and other circuits. The clock power does not include power of the clock divider and the CML to CMOS converter.



Figure 4.28: Power consumption breakdown



Figure 4.29: Internal eye diagrams measured at summer output

The effectiveness of the DFE is also verified through the internal eye monitor. With the 18dB loss channel, the eye at summer output is completely closed when theres no equalization. The eye gradually opens as more feedback taps are turned on, and the ISI is mostly removed with 3 feedback taps.

Compared to recently published DFEs, this work achieves similar data rate with the lowest supply voltage, and in a slower process. The energy efficiency is 0.21pJ/bit at 0.65V and 0.25pJ/bit at 0.7V, which is the lowest among previously published work.

| References                         | [54]                          | [55]                           | [56]                            | This Work                      |                                |
|------------------------------------|-------------------------------|--------------------------------|---------------------------------|--------------------------------|--------------------------------|
| Data Rate (Gb/s)                   | 15                            | 20                             | 16                              | 16                             |                                |
| Process                            | 45nm SOI                      | 45nm SOI                       | 40nm GP                         | 65nm GP                        |                                |
| Equalization                       | 2-tap DFE                     | CTLE +<br>1-tap DFE            | Passive LE +<br>1-tap DFE       | 3-tap DFE                      |                                |
| Clocking                           | Half Rate                     | Half Rate                      | Half Rate                       | Quarter Rate                   |                                |
| Supply (V)                         | 1.2                           | 1.2                            | 1.0                             | 0.65                           | 0.7                            |
| Channel Loss (dB)                  | 14.5                          | 26.3                           | 15                              | 13                             | 18                             |
| Timing Margin                      | 34%<br>BER < 10 <sup>-8</sup> | 26%<br>BER < 10 <sup>-12</sup> | >25%<br>BER < 10 <sup>-12</sup> | 53%<br>BER < 10 <sup>-12</sup> | 46%<br>BER < 10 <sup>-12</sup> |
| Power (mW)<br>(Including Clocking) | 7.5                           | 13.2                           | 9.25                            | 3.3                            | 4                              |
| Energy Efficiency<br>(pJ/b)        | 0.50                          | 0.66                           | 0.59                            | 0.21                           | 0.25                           |

TABLE 4.3 COMPARISON WITH PREVIOUSLY PUBLISHED DFES

## 4.7 Summary

In conclusion, a low VDD, 16Gb/s 3tap DFE is presented. Several techniques are proposed to overcome performance degradation at 0.65~0.7V, including the charge-based latch and S/H and summer common-mode restoration. It achieves the best energy efficiency to date, and can scale better with more advanced process due to its mostly digital nature.

## **CHAPTER 5. CONCLUSION**

### 5.1 Summary

In this thesis, we have explored techniques for improving the energy efficiency of serial link receivers. Architectures of both electrical and optical links are analyzed for their power consumption. 3 designs of chosen architectures are implemented with techniques like low-voltage operation, adaptive supply voltage scaling, charge-based circuits. Techniques like headroom compensation are also proposed to address the issues from low-voltage operation.

A low-voltage receiver with ILRO-based clock recovery is proposed first. The trade-offs of demultiplexing ratios are analyzed and a 1:8 demultiplexing architecture is chosen to relax the speed limitation at lower supply voltage. The receiver works under 0.6-0.8V at 4.8-8Gb/s, achieving peak energy efficiency of 0.17pJ/bit.

Next, a similar architecture is adopted for an optical receiver intended for a WDM link system. The advantages of a WDM optical link is discussed compared to electrical links. Given the potential high degree of variation in such a system, an adaptive supply voltage scaling scheme is proposed to maximize energy efficiency on a per-channel basis. The receiver is measured in both electrical and optical setup.

Finally, a DFE is presented that uses charge-based latch to enable it to operate beyond 10Gb/s under a 0.7V supply. It uses an integrating summer with common-mode restoration to compensate for reduced headroom. It is verified to work with a -13dB channel at 16Gb/s under 0.65V supply, or with a -18dB channel under 0.7V at the same data rate.

#### **5.2 Recommendation for Future Work**

Pushing the energy efficiency of high-speed links is very challenging, and there are many important challenges that cannot be thoroughly addressed within the scope of this thesis. One example is clocking. As discussed in previous chapters, clocking accounts for a major portion of the total power budget. And the margin for clock skew and jitter is also getting thinner with increasing data rates. Although clock calibration has been implemented to compensate for the increased sensitivity at low voltage, a fully automatic clock calibration scheme is highly desirable and could enable further clock power reduction.

Another potential area of work is to combine the proposed low-power clock recovery and high-speed equalizer to build a more complete high-speed, low-power receiver. We have explored applying the charge-based circuit to a CDR, which shows promise for further integration of a complete set of receiver building blocks.

# **BIBLIOGRAPHY**

[1] ISSCC 2013 Trends [Online]. Available: http://isscc.org/doc/2013/2013\_Trends.pdf

[2] F. O'Mahoney, J. E. Jaussi, J. Kennedy et al., "A 47x10Gb/s 1.4mW/Gb/s parallel interface in 45nm CMOS," IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2828-2837, Dec. 2010.

[3] K. Hu, T. Jiang, J. Wang, F. O'Mahony, and P. Y. Chiang, "A 0.6 mW/Gb/s, 6.4-7.2 Gb/s Serial Link Receiver Using Local Injection-Locked Ring Oscillators in 90 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 45, no. 4, pp. 899-908, Apr. 2010.

[4] B. Casper and F. O'Mahony, "Clocking analysis, implementation and measurement techniques for high-speed data links – a tutorial," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 1, pp. 17-39, Jan. 2009.

[5] R. Inti, et al., "A highly digital 0.5-4Gb/s 1.9mW/Gb/s serial-link transceiver using currect-recycling in 90nm CMOS," ISSCC Dig. Tech. Papers, pp. 152-153, Feb., 2011

[6] A. P. Chandrakasan et al., "Technologies for ultradynamic voltage scaling," Proc. IEEE, vol. 98, no. 2, pp. 191-214, Feb. 2010.

[7] J. Kim and M. Horowitz, "Adaptive supply serial links with sub-1V operation and per-pin clock recovery," IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1403-1413, Nov. 2002.

[8] W. D. Dettloff, J. C. Eble, L. Luo, P. Kumar, F. Heaton, T. Stone, and B. Daly, "A 32 mW 7.4 Gb/s protocol-agile source-series terminated transmitter in 45 nm CMOS SOI," in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2010, pp. 370–371.

[9] R. Sredojevic and V. Stojanović, "Digital link pre-emphasis with dynamic driver impedance modulation," Proc. IEEE Custom Integrated Circuits Conf., San Jose, CA, Sep. 2010, pp. 1–4.

[10] D. Schinkel et al., "A Double-Tail Latch-Type Voltage Sense Amplifier with 18ps Setup+Hold Time," in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, 2007, pp. 314-605.

[11] K. Hu et al., "0.16-0.25 pJ/bit, 8Gb/s near-threshold serial link receiver with super-harmonic injection locking", IEEE Journal of Solid-State Circuits, vol. 47, no. 8, pp. 1842-1853, Aug. 2012.

[12] M. Hossain and A. Chan Carusone, "7.4 Gb/s 6.8 mW Source Synchronous Receiver in 65 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 46, no. 6, pp. 1337-1348, Jun 2011.

[13] Y.-H. Song, R. Bai, K. Hu, H.-W. Yang, P. Chaing, and S. Palermo, "A 0.47–0.66 pJ/bit, 4.8–8 Gb/s I/O transceiver in 65 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48, no. 5, pp. 1276-1289, May 2013.

[14] I. Young, E. Mohammed, J. Liao, A. Kern, S. Palermo, B. Block, and M. Reshotko, and P. Chang, "Optical I/O technology for tera-scale computing," IEEE Journal of Solid-State Circuits, vol. 45, no. 1, pp. 235-248, Jan. 2010.

[15] M. Lipson, "Compact Electro-Optic Modulators on a Silicon Chip," IEEE Journal of Selected Topics in Quantum Electronic, Vol. 12, no. 6, pp.1520-1526, Nov/Dec. 2006

[16] Q. Xu, S. Manipatruni, B. Schmidt, J. Shakya, and M. Lipson, "12.5 Gbit/s carrier-injection-based silicon micro-ring silicon modulators" Opt. Express, vol. 15, no. 2, pp. 430-436, Jan. 2007.

[17] Z. Peng, D. Fattal, M. Fiorentino, and R. Beausoleil, "CMOS-compatible microring modulators for nanophotonic interconnect," in Integrated Photonics Research, Silicon and Nanophotonics (IPRSN), July 2010.

[18] G. T. Reed, G. Mashanovich, F. Y. Gardes, and D. J. Thomson, "Silicon optical modulators," Nature Photonics, vol. 4, pp. 518-526, July 2010.

[19] G. Li, X. Zheng, J. Yao, H. Thacker, I. Shubin, Y. Luo, K. Raj, J. E. Cunningham, and A. V. Krishnamoorthy, "High-efficiency 25Gb/s CMOS ring modulator with integrated thermal tuning," 8th IEEE Intentional Conference on Group IV Photonics (GFP), 2011, pp. 8 – 10.

[20] L. Liao, A. Liu, J. Basak, H. Nguyen, and M. Paniccia, "Silicon photonic modulator and integration for high-speed applications," IEEE Intentional Electron Devices Meeting (IEDM), 2008.

[21] B. G. Lee, A. V. Rylyakov, W. M. J. Green, S. Assefa, C. W. Baks, R. Rimolo-Donadio, D. M. Kuchta, M. H. Khater, T. Barwicz, C. Reinholm, E. Kiewra, S. M. Shank, C. L. Schow, and Y. A. Vlasov, "Four- and eight-port photonic switches monolithically integrated with digital CMOS logic and driver circuits," IEEE-OSA Optical Fiber Communications Conference, Mar. 2013, pp. 1-3.

[22] J. E. Roth, S. Palermo, N. C. Helman, D. P. Bour, D. A. B. Miller, and M. Horowitz, "An optical interconnect transceiver at 1550nm using low-voltage electroabsorption modulators directly integrated to CMOS," IEEE-OSA Journal of Lightwave Technology, vol. 25, no. 12, pp. 3739-3747, Dec. 2007.

[23] A. V. Krishnamoorthy, X. Zheng, G. Li, J. Yao, T. Pinguet, A. Mekis, H. Thacker, I. Shubin, Y. Luo, K. Raj, and J. E. Cunningham, "Exploiting CMOS manufacturing to reduce tuning requirements for resonant optical devices," IEEE Photonics Journal, vol. 3, no. 3, pp. 567 – 579, June 2011.

[24] B. R. Moss, C. Sun, M. Georgas, J. Shainline, J. S. Orcutt, J. C. Leu, M. Wade, Y. Chen, K. Nammari, X. Wang, H. Li, R. Ram, M. A. Popovic, and V. Stojanovic, "A 1.23pJ/b 2.5Gb/s monolithically integrated optical carrier-injection ring modulator and all-digital driver circuit in commercial 45nm SOI," in IEEE ISSCC Dig. Tech. Papers, Feb. 2013, pp. 126-127.

[25] G. Li, X. Zheng, J. Yao, H. Thacker, I. Shubin, Y. Luo, K. Raj, J. E. Cunningham, and A. V. Krishnamoorthy, "25Gb/s 1V-driving CMOS ring modulator with integrated thermal tuning," Optics Express, vol. 19, no. 21, pp. 20435-20443, Oct. 2011.

[26] C. Li, R. Bai, A. Shafik, E. Z. Tabasy, G. Tang, C. Ma, C. Chen, Z. Peng, M. Fiorentino, P. Chiang, and S. Palermo, "A ring-resonator-based silicon photonics transceiver with bias-based wavelength stabilization and adaptive-power-sensitivity receiver," in IEEE ISSCC Dig. Tech. Papers, Feb. 2013, pp. 124-125.

[27] C.-H. Chen, C. Li, R. Bai, A. Shafik, M. Fiorentino, Z. Peng, P. Chiang, S. Palermo, and R. Beausoleil, "Hybrid integrated DWDM silicon photonic transceiver with self-adaptive CMOS circuits," in IEEE Optical Interconnects Conference, May 2013, pp. 122-123.

[28] M. Georgas, J. Leu, B. Moss, C. Sun, and V. Stojanovic, "Addressing linklevel design tradeoffs for integrated photonic interconnects," IEEE Custom Integrated Circuits Conference, Sept. 2011, pp. 1-8.

[29] F. Y. Liu, D. Patil, J. Lexau, P. Amberg, M. Dayringer, J. Gainsley, H. F. Moghadam, X. Zheng, J. E. Cunningham, A. V. Krishnamoorthy, E. Alon, and R. Ho, "10-Gbps, 5.3-mW optical transmitter and receiver circuits in 40-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 47, no. 9, pp. 2049-2067, Sept. 2012.

[30] C. Sun, E. Timurdogan, M. R. Watts, and V. Stojanovic, "Integrated microring tuning in deep-trench bulk CMOS," IEEE Optical Interconnects Conference, May 2013, pp. 54-55.

[31] J. Buckwalter, X. Zheng, G. Li, K. Raj, and A. Krishnamoorthy, "A monolithic 25-Gb/s transceiver with photonic ring modulators and Ge detectors in a 130-nm CMOS SOI process," IEEE J. Solid-State Circuits, vol. 47, no. 6, pp. 1309-1322, June 2012.

[32] A. Liu, L. Liao, D. Rubin, J. Basak, H. Nguyen, Y. Chetrit, R. Cohen, N. Izhaky, and M. Paniccia, "High-speed silicon modulator for future VLSI interconnect," in Integrated Photonics and Nanophotonics Research and Applications, OSA Technical Digest (CD), Optical Society of America, paper IMD3 (2007).

[33] G. Wojcik, D. Yin, A. Kovsh, A. Gubenko, I. Krestnikov, S. Mikhrin, D. Livshits, D. Fattal, M. Fiorentino, and R. Beausoleil, "A single comb laser source for short reach WDM interconnects," SPIE Photonics West, 7230-21 (2009).

[34] S. M. Park and H. Yoo, "1.25-Gb/s regulated cascade CMOS transimpedance amplifier for gigabit ethernet applications," IEEE Journal of Solid-State Circuits, vol. 39, no. 1, pp. 112-121, Jan. 2004.

[35] C. Li and S. Palermo, "A low-power 26-GHz transformer-based regulated cascode SiGe BiCMOS transimpedance amplifier," IEEE J. Solid-State Circuits, vol. 48, no. 5, pp. 1264-1275, May 2013.

[36] J. Proesel, C. Schow, and A. Rylyakov, "25Gb/s 3.6pJ/b and 15Gb/s 1.37pJ/b VCSEL-based optical links in 90nm CMOS," in IEEE ISSCC Dig. Tech. Papers, Feb. 2012, pp. 418-419.

[37] S. Palermo, A. Emami-Neyestanak, and M. Horowitz, "A 90 nm CMOS 16 Gb/s transceiver for optical interconnects," IEEE Journal of Solid-State Circuits, vol. 43, no. 5, pp. 1235-1246, May 2008.

[38] M. Georgas, J. Orcutt, R. J. Ram, and V. Stojanovic, "A monolithicallyintegrated optical receiver in standard 45-nm SOI," IEEE Journal of Solid-State Circuits, vol. 47, no. 7, pp. 1693-1702, July 2012.

[39] A. Ragab, Y. Liu, K. Hu, P. Chiang, and S. Palermo, "Receiver jitter tracking characteristics in high-speed source synchronous links," Journal of Electrical and Computer Engineering, vol. 2011, Article ID 982314, 2011.

[40] S. Parikh, T. Kao, Y. Hidaka, J. Jiang, A. Toda, S. Mcleod, W. Walker, Y. Koyanagi, T. Shibuya, and J. Yamada, "A 32Gb/s wireline receiver with a low-

frequency equalizer, CTLE and 2-Tap DFE in 28nm CMOS," in IEEE ISSCC Dig. Tech. Papers, Feb. 2013, pp. 28-29.

[41] F. Zhong, S. Quan, W. Liu, P. Aziz, T. Jing, J. Dong, C. Desai, H. Gao, M. Garcia, G. Hom, T. Huynh, H. Kimura, R. Kothari, L. Li, C. Liu, S. Lowrie, K. Ling, A. Malipatil, R. Narayan, T. Prokop, C. Palusa, A. Rajashekara, T. Prokop, C. Palusa, A. Rajashekara, A. Sinha, C. Zhong, and E. Zhang, "A 1.0625 14.025 Gb/s multimedia transceiver with full-rate source-series-terminated transmit driver and floatingtap decision-feedback equalizer in 40 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 46, no. 12, pp. 3126-3139, Dec. 2011.

[42] M. Mansuri, J. Jaussi, J. Kennedy, T.-C. Hsueh, S. Shekhar, G. Balamurugan, F. O'Mahony, C. Roberts, R. Mooney, and B. Casper, "A scalable 0.128–1 Tb/s, 0.8–2.6 pJ/bit, 64-lane parallel I/O in 32-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48, no. 12, pp. 3229-3242, Dec. 2013.

[43] B. Raghavan, D. Cui, U. Singh, H. Maarefi, D. Pi, A. Vasani, Z. Huang, B. Catli, A. Momtaz, and J. Cao, "A sub-2 W 39.8–44.6 Gb/s transmitter and receiver chipset With SFI-5.2 interface in 40 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48, no. 12, pp. 3219-3228, Dec. 2013.

[44] S. S. Mohan, M. D. M. Hershenson, S. P. Boyd, and T. H. Lee, "Bandwidth extension in CMOS with optimized on-chip inductors," Solid-State Circuits, IEEE Journal of, vol. 35, no. 3, pp. 346–355, 2000.

[45] M. Park, J. Bulzacchelli, M. Beakes, and D. Friedman, "A 7Gb/s 9.3mW 2-Tap Current-Integrating DFE Receiver," in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, 2007, pp. 230–599.

[46] K.-L. J. Wong, A. Rylyakov, and C.-K. K. Yang, "A 5-mW 6-Gb/s Quarter-Rate Sampling Receiver With a 2-Tap DFE Using Soft Decisions," IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp. 881–888, Apr. 2007.

[47] Y. Lu and E. Alon, "Design Techniques for a 66 Gb/s 46 mW 3-Tap Decision Feedback Equalizer in 65 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48, no. 12, pp. 3243–3257, Dec. 2013.

[48] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/Deserializer," IEEE Journal of Solid-State Circuits, vol. 48, no. 3, pp. 684–697, Mar. 2013.

[49] S.-H. W. Chiang, H. Sun, and B. Razavi, "A 10-Bit 800-MHz 19-mW CMOS ADC," IEEE Journal of Solid-State Circuits, vol. 49, no. 4, pp. 935–949, Apr. 2014.

[50] T. Toifl, M. Ruegg, R. Inti, C. Menolfi, M. Brandli, M. Kossel, P. Buchmann, P. A. Francese, and T. Morf, "A 3.1mW/Gbps 30Gbps quarter-rate triple-speculation 15-tap SC-DFE RX data path in 32nm CMOS," in 2012 Symposium on VLSI Circuits (VLSIC), 2012, pp. 102 –103.

[51] T. Toifl, C. Menolfi, M. Ruegg, R. Reutemann, P. Buchmann, M. Kossel, T. Morf, J. Weiss, and M. L. Schmatz, "A 22-gb/s PAM-4 receiver in 90-nm CMOS SOI technology," IEEE Journal of Solid-State Circuits, vol. 41, no. 4, pp. 954–965, Apr. 2006.

[52] M. Jeeradit, J. Kim, B. Leibowitz, P. Nikaeen, V. Wang, B. Garlepp, and C. Werner, "Characterizing sampling aperture of clocked comparators," in 2008 IEEE Symposium on VLSI Circuits, 2008, pp. 68–69.

[53] A. Agrawal, J. F. Bulzacchelli, T. O. Dickson, Y. Liu, J. A. Tierno, and D. J. Friedman, "A 19-Gb/s Serial Link Receiver With Both 4-Tap FFE and 5-Tap DFE Functions in 45-nm SOI CMOS," IEEE Journal of Solid-State Circuits, vol. 47, no. 12, pp. 3220–3231, Dec. 2012.

[54] M. Nazari and A. Emami-Neyestanak, "A 15Gb/s 0.5mW/Gb/s 2-Tap DFE Receiver with Far-End Crosstalk Cancellation," ISSCC Dig. Tech. Papers, pp. 446-447, Feb 2011.

[55] J. Proesel and T. Dickson, "A 20-Gb/s, 0.66-pJ/bit Serial Receiver with 2-Stage Continuous-Time Linear Equalizer and 1-Tap Decision Feedback Equalizer in 45nm SOI CMOS," Symposium on VLSI Circuits Dig. Of Tech. Papers, pp. 206-207, June 2011. [56] K. Kaviani et al., "A 0.4-mW/Gb/s Near-Ground Receiver Front-End With Replica Transconductance Termination Calibration for a 16-Gb/s Source-Series Terminated Transceiver," Solid-State Circuits, IEEE Journal of , vol.48, no.3, pp.636-648, March 2013.