China 简体中文 Japan 日本语 United States English
International Office Locations
Articles  



IMPACT: IMPrecise adders for low-power Approximate Computing
Low-power is an imperative requirement for portable multimedia devices employing various signal processing algorithms and architectures. In most multimedia applications, the final output is interpreted by human senses, which are not perfect. This fact obviates the need to produce exactly correct numerical outputs. Previous research in this context exploits error-resiliency primarily through voltage overscaling, utilizing algorithmic and architectural techniques to mitigate the resulting errors. In this paper, we propose logic complexity reduction as an alternative approach to take advantage of the relaxation of numerical accuracy. We demonstrate this concept by proposing various imprecise or approximate Full Adder (FA) cells with reduced complexity at the transistor level, and utilize them to design approximate multi-bit adders. In addition to the inherent reduction in switched capacitance, our techniques result in significantly shorter critical paths, enabling voltage scaling. We design architectures for video and image compression algorithms using the proposed approximate arithmetic units, and evaluate them to demonstrate the efficacy of our approach. Post-layout simulations indicate power savings of up to 60% and area savings of up to 37% with an insignificant loss in output quality, when compared to existing implementations.
Jan 27, 2012

Fast and Compact Binary-to-BCD Conversion Circuits for Decimal Multiplication
Decimal arithmetic has received considerable attention recently due to its suitability for many financial and commercial applications. In particular, numerous algorithms have been recently proposed for decimal multiplication. A major approach to decimal multiplication shaped by these proposals is based on performing the decimal digit-by-digit multiplication in binary, converting the binary partial product back to decimal, and then adding the decimal partial products as appropriate to form the final product in decimal. With this approach, the efficiency of binary-to-BCD partial product conversion is critical for the efficiency of the overall multiplication process. A recently proposed algorithm for this conversion is based on splitting the binary partial product into two parts (i.e., two groups of bits), and then computing the contributions of the two parts to the partial BCD result in parallel. This paper proposes two new algorithms (Three-Four split and Four-Three split) based on this principle . We present our proposed architectures that implement these algorithms and compare them to existing algorithms. The synthesis results show that the Three-Four split algorithm runs 15%faster and occupies 26.1%less area than the best performing equivalent circuit found in the literature. Furthermore, the Four-Three split algorithm occupies 37.5% less area than the state of the art equivalent circuit.
Jan 27, 2012

Methodology for Local Resonant Clock Synthesis using LC-Assisted Local Clock Buffers
Resonant clocking is a form of adiabatic clocking that retains much of the energy present in clock switching and recycles it into the following clock cycle. In this paper we present the first automated methodology using LC-assisted local clock buffers (LCLCB) for generating local resonant clocks. This uses a single-buffer singleinductor sector topology applied to non-uniform trees as found in most ASIC designs. We show that this form of adiabatic clocking can achieve power savings as much as 75% over traditional buffered clock networks.
Jan 27, 2012

Approximating Pareto Optimal Compiler Optimization Sequences – a Trade-off between WCET, ACET and Code Size
With the growing complexity of embedded systems software, high code quality can only be achieved using a compiler. Sophisticated compilers provide a vast spectrum of various optimizations to improve code aggressively w.r.t. different objective functions, e.g. average-case execution time (ACET) or code size. Owing to the complex interactions between the optimizations, the choice for a promising sequence of code transformations is not trivial. Compiler developers address this problem by proposing standard optimization levels, e.g. O3 or Os. However, previous studies have shown that these standard levels often miss optimization potential or might even result in performance degradation. In this paper, we propose the first adaptive worst-case execution time (WCET)-aware compiler framework for an automatic search of compiler optimization sequences that yield highly optimized code. Besides the objective functions ACET and code size, we consider the WCET which is a crucial parameter for real-time systems. To find suitable trade-offs between these objectives, stochastic evolutionary multi-objective algorithms identifying Pareto optimal solutions for the objectives _WCET, ACET_ and _WCET, code size_ are exploited. A comparison based on statistical performance assessments is performed that helps to determine the most suitable multiobjective optimizer. The effectiveness of our approach is demonstrated on real-life benchmarks showing that standard optimization levels can be significantly outperformed.
Jan 12, 2012

WCET-aware Register Allocation based on Integer-Linear Programming
Current compilers lack precise timing models guiding their built-in optimizations. Hence, compilers apply ad-hoc heuristics during optimization to improve code quality.One of the most important optimizations is register allocation. Many compilers heuristically decide when and where to spill a register to memory, without having a clear understanding of the impact of such spill code on a program’s runtime.This paper presents an integer-linear programming (ILP) based register allocator that uses precise worst-case execution time (WCET) models. Using this WCET timing data, the compiler avoids spill code generation along the critical path defining a program’s WCET. To the best of our knowledge, this paper is the first one to present a WCET-aware ILP-based register allocator. Our results underline the effectiveness of the proposed techniques. For a total of 55 realistic benchmarks, we reduced WCETs by 20.2% on average and ACETs by 14%, compared to a standard graph coloring allocator. Furthermore, our ILP-based register allocator outperforms a WCET-aware graph coloring allocator by more than a factor of two for the considered benchmarks, while requiring less runtime.
Jan 12, 2012

Towards Network Centric Development of Embedded Systems
Nowadays, the development of embedded system hardware and related system software is mostly carried out using virtual platform environments. The high level of modeling detail (hardware elements are partially modeled in a cycleaccurate fashion) is required for many core design tasks. At the same time, the high computational complexity of virtual platforms caused by the detailed level of simulation hinders their application for modeling large networks of embedded systems. In this paper, we propose the integration of virtual platforms with network simulations, combining the accuracy of virtual platforms with the versatility and scalability of network simulation tools. Forming such a hybrid toolchain facilitates the detailed analysis of embedded network systems and related important design aspects, such as resource effectiveness, prior to their actual deployment.
Jan 12, 2012

A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding
Multiple-input multiple-output (MIMO) wireless transmission imposes huge challenges on the design of efficient hardware architectures for iterative receivers. A major challenge is soft-input soft-output (SISO) MIMO demapping, often approached by sphere decoding (SD). In this brief, we introduce—to our best knowledge—the first VLSI architecture for SISO SD applying a single tree-search approach. Compared with a softoutput- only base architecture similar to the one proposed by Studer et al. in IEEE J-SAC 2008, the architectural modifications for soft input still allow a one-node-per-cycle execution. For a 4×4 antennas system using quadrature amplitude modulation (QAM) with order 16, the area increases by 57%, and the operating frequency degrades by 34% only.
Jan 12, 2012

A Transregional Model for Near-Threshold Circuits with Application to Minimum-Energy Operation
The most energy-efficient operating point for CMOS circuits is near the threshold voltage. Conventional models are difficult to use in this region because they are piecewise and/or discontinuous around threshold. This paper proposes a simple new model for Ion that is valid in the near-threshold region. Based on the ON-current, a propagation delay model is derived. The model is applied to determine the minimum energy point for inverter chains. The transregional model matches simulated data within 15 mV, while the conventional exponential subthreshold modelunderestimates the supply voltage by up to 80 mV.
Jan 12, 2012

Development of Radical Kinetic Behavior Investigation Method and its Application for Sticking Coefficient Estimation
We have developed a new technique for radicals kinetic behavior investigation. Our approach is based on the application of a parallel plate structure in conjunction with numerical analysis. Investigation of radical behavior becomes especially critical for manufacturing devices with single nanometer gate lengths or devices with 3-dimensional gates in particular. Super-fine etching for such devices requires precisely-controlled plasma processings which can be obtained only by controlling its internal parameters. In this study we present investigation of the sticking coefficient (SC) as the kinetic behavior of hydrogen radical (H*) in processing of ArF 193-nm photoresist. We primarily obtained 0.2 as the SC of H* radical on the photoresist using this new technique.
Dec 19, 2011

Investigation of Surface Reactions in ArF Photoresist by using Parallel Plate Structure in Conjunction with Numerical Analysis
Predictive modeling for resist profiles and etched-pattern profiles is one of the bottlenecks in semiconductor process simulation. Accurate reaction models for chemically amplified resists (CAR), which include surface modifications such as line edge roughness, are needed and must be capable of correctly predicting three-dimensional resist patterns. Photoresist patterns must be evaluated with respect to precise critical dimension (CD) and their etch resistance. Because of the increasing importance of polymer-size effects on the control of CD including line edge roughness and line width variation, there is a growing need for resist studies based on mesoscopic models and stochastic modeling. Although the algorithms for modeling of plasma etching seem to be mature, the capability of quantitative prediction strongly depends on the fundamental physical, chemical data of surface reactions. This concerns especially to the lack of radical kinetic data. In our research we investigated radicals kinetic behavior related to surface reactions in ArF photoresist (PR) by using parallel plate structure supported by numerical analysis.
Dec 19, 2011

Radical Flux Modeling and Analysis for Sticking Coefficient Evaluation
In plasma processes kinetic data for radicals is the most crucial ingredient for predictive simulation. The most neglected area in this regard is the lack of fundamental data for gas surface or plasma-surface interactions – radical sticking coefficient (SC - depends on material surface). Since ab-initio approach to surface interaction requires too heavy computer resources, the experimental validation assisted by computational analysis is not only cost-effective but also indispensable for finding out the principal reaction mechanism.
Dec 19, 2011

Performance Estimation of Carbon Nanowall-based Field Effect Transistor by 3D Simulation Study
We have developed techniques for localized standing vertically Carbon Nanowall (CNW) manufacturing as well as for the methods of controlling their electrical properties. CNW is a bundle of domains consisting of nano graphenes and the thickness of CNW is several nanometers. By incorporating this material in the channel we have proposed new device structure for future generations of CMOS technology. Device electrical parameters have been evaluated by 3D TCAD simulation study. Reliability of the simulated data has been provided by defining new material file based on the realistic parameters modified from the data for Graphene nanoribbon (GNR). Simulation results shown in this paper lead to the conclusion that device performance can be improved by widening the graphene energy bandgap.
Dec 19, 2011

Enhanced Electrostatic Integrity of Short Channel Junctionless Transistor with High-k Spacers
We propose the use of a high-κ spacer to improve the electrostatic integrity and, thereby, the scalability of silicon junctionless transistors (JLTs) for the first time. Using extensive simulations of n-channel JLTs, we demonstrate that the high-κ spacers improve the electrostatic integrity of JLTs at sub-22-nm gate lengths. Electric field that fringes through the high-κ spacer to the device layer on either sides of the gate results in an effective increase in electrical gate length in the OFF-state. However, the effective gate length is unaffected in the ON-state. Hence, the OFF-state leakage current is reduced by several orders of magnitude with the use of a high-κ spacer with concomitent improvements in the subthreshold swing and drain-induced barrier lowering. A marginal improvement in the ON-state current is observed with the use of the high-κ spacer, and this is related to the reduction in parasitic resistance in the silicon under the spacer due to fringe fields.
Dec 19, 2011

Leveraging UPF-Extracted Assertions for Modeling and Formal Verification of Architectural Power Intent
Recent research has indicated ways of using UPF specifications for extracting valid low-level control sequences to express the transitions between the power states of individual domains. Today there is a disconnect between the high-level architectural power management strategy which relates multiple power domains and these low-level assertions for controlling individual power domains. In this paper we attempt to bridge this disconnect by leveraging the low-level per-domain assertions for translating architectural power intent properties into global assertions over low-level signals. We show that the inter-domain properties created in this manner can be formally verified over the global power management logic.
Dec 06, 2011

Coverage Management with Inline Assertions and Formal Test Points
This paper studies the problem of coverage management with two emerging formalisms in simulation based validation, namely formal specification of test points and the use of inline temporal assertions. We present methods for checking whether a test-bench with inline assertion covers a set of formal test points. This is particularly useful in developing verification IPs for standard on-chip protocols where the development team must make sure that the test bench provided in the verification IP checks all the important aspects of the protocol. We demonstrate the efficacy of our approach over the ARM AMBA verification IP.
Dec 06, 2011

Inline Assertions – Embedding Formal Properties in a Test Bench
The scope of immediate assertions in SystemVerilog is restricted to Boolean properties, where as temporal properties are specified as concurrent assertions. Concurrent assertion statements can also be embedded in a procedural block - known as procedural concurrent assertions which are used under restricted situations. This paper introduces the notion of inline assertions which generalizes the embedding of temporal properties within the procedural code of a test bench. The paper proposes verification methodologies for inline assertions and compares them with the traditional approaches of formal property verification and dynamic assertion based verification. The paper also focuses on coverage related issues when the intent of a concurrent assertion is modeled as an inline assertion.
Dec 06, 2011

Cohesive Coverage Management for Simulation and Formal Property Verification
Relating formal verification coverage and simulation coverage is a challenge in pre-silicon validation. In this paper we propose the use of a test plan language as a formal basis for unifying the coverage goals for simulation and formal property verification. We present methods for computing the coverage of test points individually through simulation and formal property verification and for using the coverage due to one to ease the verification effort on the other. We demonstrate the efficiency of our approach through a study of the ARM AMBA pre-silicon verification plan.
Dec 06, 2011

A Novel Radation Tolerant SRAM Design Based on Synergetic Functional Component Separation for Nanoscale
This paper presents a novel SRAM design for nanoscale CMOS. The new design addresses the problem of low radiation tolerance and high instability for SRAM memories at feature size of 32 nm. The novelty of our approach originates from the synergetic functional component separation, where each component serves its unique operational function and has minimal effect on performance of others. The design consists of three different components: the first component is used to store the data, the second one is designed to protect the data at the most vulnerable state and last component serves to extract the data from the SRAM cell. We performed comparative analysis of our design against conventional radiation-tolerant designs in terms of power consumption, level of radiation tolerance, performance, area and stability. The benefits of our new design (high radiation tolerance, high stability, fast performance) were confirmed by extensive simulations in different 32 nm technology environments (low power, high performance, bulk).
Dec 06, 2011

Noise Margin, Critical Charge and Power-delay Tradeoffs for SRAM Design
Aggressive technology scaling has resulted in stability reduction for classic SRAM designs. This is especially problematic for large integrated circuits. The stability of SRAM cells can be affected by noise during a read operation and by radiation during the standby mode. In this paper, we present an approach to address the gradual stability reduction in SRAM designs. We present an SRAM design tradeoffs approach to improve the characteristics of SRAM by modulating the transistor sizing ratio, β. We test our approach on various SRAM designs in 32nm technology. We optimize the SRAM designs with β for various constraints in power consumption, performance, radiation tolerance and data stability. We discuss different design trends produced by the extensive approach analysis.
Dec 06, 2011

A Recursive-Divide Architecture for Multiplication and Division
Multipliers have been key and critical components for most application-specific and general-purpose computer architectures. However, these architectures have been transitioning towards multiple cores that can process large amounts of data through parallel approaches to computation. Unfortunately, traditional arithmetic functional units that worked well for single-core architectures have the side-effect of incurring large amounts of area and power. Consequently, multi-core computer architecture need new ways of thinking about increased throughput to handle large amounts of data. This paper presents a recursive high radix divide unit that is modified to handle both multiplication and division targeted at multi-core architectures. Results are obtained with a 65nm technology and show a significant decrease in area and power while still maintaining a low total latency by utilizing high radix encoding within the functional unit. More importantly, because the datapath unit requires complex recoding, it does not increase its latency as the bit size increases. Therefore, these units can occupy low amounts of area while still maintaining high amounts of processing power.
Dec 06, 2011

Reliability Analysis of Power Gated SRAM under Combined Effects of NBTI and PBTI in Nano-Scale CMOS
Transistor aging effects (NBTI and PBTI) impact the reliability of SRAM in nano-scale CMOS technologies. In this research, the combined effect of NBTI and PBTI on power gated SRAM is analyzed. Optimal source biasing in the standby mode is presented as an effective method for guard-banding against the aging effects. The simulations results in a predictive 32nm technology shows maximum of 1.6% reduction in standby SNM over 5 year lifetime. For optimum operation, by decreasing the standby source bias voltage by only 0.012 volts, the SNM is safely margined for 5 year life time. This guard-banding comes at an insignificant power overhead of 0.6% for applied worse case scenarios. Given the insignificant power overhead with such guard-banding, it is concluded that adaptive tuning of the source biasing voltage is not required, given the not so negligible complexity and overhead associated with adaptive techniques.
Nov 08, 2011

Incorporating Effects of Process, Voltage, and Temperature Variation in BTI Model for Circuit Design
Bias Temperature Instability (BTI) is a major reliability issue in Nano-scale CMOS circuits. BTI effect results in the threshold voltage increase of MOS devices over time. Given the Process, Voltage, and Temperature (PVT) dependence of BTI effect, and the significant amount of PVT variations in Nano-scale CMOS, we propose a method of combining the effects of PVT variations and the BTI effect for circuit analysis. We investigate the PVT dependence of BTI effect on a ring oscillator circuit as a test bench for logic circuits and an SRAM cell as a test bench for memory circuits. The results show that low threshold voltage circuits at high temperature experiences the worst impact of aging effects. However, the bias dependence of the impact of aging effects on circuits may vary from circuit to circuit and from metric to metric depending how the sensitivity of the circuit to BTI threshold voltage shift at different biasing changes.
Nov 08, 2011

New SRAM Design Using Body Bias Technique for Ultra Low Power Applications
A new SRAM design is proposed. Body biasing improves the static noise margin (SNM) improved by at least 15% compared to the standard cells. Through using this technique, lowering supply voltage is possible. This SRAM cell is working under 0.3 V supply voltage offering a SNM improvement of 22% for the read cycle. Write Margin is not affected due to using body biasing technique. 65 nm ST models are used for simulations.
Nov 08, 2011

Full-Custom Design Project for Digital VLSI and IC Design Courses using Synopsys Generic 90nm CMOS Library
We have developed a full-custom IC design flow based on Synopsys custom design tools and the recently released Synopsys 90 nm generic library. The developed design flow can be used for teaching VLSI and digital IC design courses. We have also developed a full-custom design project that was used as a course project in teaching ldquoDigital VLSI Designrdquo course at San Francisco State University. The design project is to design a 4-bit ripple carry adder in a full custom fashion from schematic to layout in the generic 90 nm CMOS technology. The developed design flow and the course project provide a very effective hands-on approach to teaching digital IC design and VLSI design in advanced CMOS technologies. The team project was conducted in a competition based format providing great enthusiasm and motivation among the students, enhancing their learning experience. The competition was to achieve the best design quality defined as the product of following design metrics: propagation delay, power dissipation, and layout area for the 4-bit ripple carry adder. The winning team achieved a delay of 82.2 ps, power dissipation of 30.7 muW, and layout area of 112.8 mum2 for the 4-bit adder.
Nov 08, 2011




Manage Subscription

Contact us

ArticlesBlogsPresentationsEventsVideosNewslettersForums