Managing the Sensor Data Deluge
Jamini Samantaray and Scott Bushman
Big data configurations for streamlined data analysis and predictive solutions.
Over the last few years, manufacturers have become increasingly challenged to collect more sensor data, retain it for longer periods of time and use it effectively. For example, tool sensor and fault-detection production databases in most state of-the-art fabs require 20–30 terabytes of storage capacity to retain 1-3 months of data. This sensor data deluge is likely to become even more acute as we move to sub-20nm node manufacturing and new generations of tools.
While collecting and storing this data is critical to achieving necessary yields, cycle times and costs, it is just one aspect of the data problem. Another, more important one is the need to generate significant value from that data by analyzing it at a speed and cost that positively impacts both tool performance and factory yield. Timely data analysis is absolutely essential to identify optimization opportunities significantly faster than at present.
Fortunately, advances in data management, data analysis techniques and predictive technologies offer the semiconductor industry promising new solutions to do exactly that.
Problems in Traditional Storage of Sensor Data
The analytical software used with sensor and statistical data requires users to query the database by using predicates based on (i.e., by comparing) (i) one or more tools, (ii) specific time ranges, (iii) sets of sensors, statistics, recipes, lots, wafers, and the like.
However, current data-storage strategies are not optimized for querying by such predicates on big data sets. As a result, the large-scale growth of data has caused critical problems in sensor data storage and in the efficient execution of these analytic queries.
The first major problem is that getting data from hundreds--or even thousands--of tools into a central storage system requires high-performance storage systems. But with current storage technology and price points, storing hundreds of terabytes of data would significantly increase the infrastructure cost of any fab-wide equipment engineering solution (EES).
For example, for a 400TB enterprise-class central storage system, the cost per terabyte can be up to four times the cost of equivalent locally attached storage with similar redundancy.
The second problem is that most sensor data is structured and stored in the traditional row-column manner of a relational format. But at these massive volumes, this technique does not scale at the performance level required by contemporary fault detection, prediction and yield analysis applications.
Thus, the use of traditional relational data processing technology to handle large-scale data is becoming cost–prohibitive. It can seriously impact the return on investment for new generations of applications.
Addressing the Problems
Advances in data management technology over the last few years in industries where large-scale information management is a core requirement—such as social media, retail and finance—have opened up the possibility of managing sensor and other semiconductor manufacturing data much more efficiently.
One compelling solution may be Apache Hadoop, an open-source software framework for storing and processing large amounts of data in a distributed fashion on clusters of commodity hardware. The idea is to enable both massive data storage and faster processing at lower costs.
This open-source software platform primarily consists of Hadoop Distributed File system (HDFS) and a computing framework that parallelizes computing on the distributed file system. HDFS can scale from tens to thousands of commodity servers (also known as data nodes), enabling very large sets of data to be widely distributed with locally attached storage, which significantly reduces storage costs.
When data is queried, the computing framework processes data in parallel on a large number of data nodes, which minimizes the processing time required to scan large datasets. In addition, other auxiliary technologies available on the Hadoop platform help with efficient data ingestion, storage, the querying of data using structured query language (SQL), providing security and similar enterprise data-processing needs.
Hadoop data storage solves several problems that are critical in the manufacturing environment. First, HDFS cost is roughly a quarter of the data storage cost of centralized systems, because the data storage can be expanded with the addition of low-cost storage devices.
Second, a larger data store allows manufacturing operations to retain and query larger datasets over a longer time frame than a traditional central repository. Some Applied Materials EES system customers are now asking for the ability to query up to two years of data, as problems of interest move from simple excursion control to deeper data analysis.
Customers are also storing more data types, including event, metrology, and image data, and are requesting that this data be available to the typical trace and summary statistical data.
Finally, several customers have multiple fabs using EES solutions and there is a need to share and communicate results across fabs. Customers are looking for a central storage location that can be queried and mined for diagnostic results by multiple fabs.
Hadoop in Semiconductor Manufacturing
Applied Materials is currently working on multiple applications based on predictive technologies and on near-real-time data analysis to improve yield and tool performance. However, although Hadoop may provide an attractive distributed data storing and processing framework for these applications, by itself it is not sufficient to support their needs.
Nonetheless, here is a quick look at how the Hadoop framework might apply in a semiconductor manufacturing environment. For redundancy and high availability of data, Hadoop distributes data files in pre-defined block sizes across tens of data nodes, as shown schematically in figure 1. For example, if a data file size is 256MB (megabytes) and the Hadoop block size (i.e., the minimum unit of contiguous data) is 128MB, the data is divided into 2 blocks of 128MB each. Redundant copies of the blocks are distributed on multiple nodes on the Hadoop cluster.
In this example, a query that requires scanning the whole file can run in two parallel processes. As the number of days of data grows, and as the number of tools that need to be accessed in a query also grows, the degree of parallelism grows as well. This greatly increases the efficiency of data retrieval.Figure 1. Sensor data distributed across multiple nodes on a Hadoop cluster. Query processing is also distributed/ parallelized on multiple nodes
Moreover, this framework also makes it possible to break up queries that require processing of large sets of data, so that parts of queries can be run in parallel across multiple nodes. As a result, processing time for the query is greatly reduced even as data size grows.
For example, assume that a query predicate is based on certain time periods and set of tools, i.e. filtered by two columns in the database (time, tool). In addition, assume that the data of interest is from only 10 sensors out of 100 sensors stored for the set of tools.
With the Hadoop framework, the query engine scans compressed and contiguous values of the two columns (time, tool) to filter data and retrieve input from the 10 sample sensors . This greatly reduces the amount of data scanned by the query engine because it does not process data from 90% (90 out of 100) of the sensors stored for the tool.
Although Hadoop is designed for large-scale storage and analysis, the vast majority of existing data processing requirements in fabs requires optimal performance on small sets of data. Typical use cases for HDFS would include the traditional reporting, simulation, and configuration capabilities found within contemporary EES systems.
Also, the addition of HDFS needs to be seamlessly integrated with other EES applications. Reporting and simulation environments need to query from short-term and long-term data storage locations, merge the data, and then report through standard interfaces. Customers are not looking for a new user interface for HDFS data.
For customers who have access to larger datasets made possible by HDFS, it is anticipated there will be increased interest in performing advanced data analysis activities. Exciting solutions using larger datasets include the ability to perform chamber matching and fingerprinting across multiple maintenance events and across multiple tools.
Also, several cluster analysis techniques can be applied to trace and summary statistic data (e.g., good vs. bad) and trends observed. Generally this will require more than one year of data to be effective.
Hadoop Brings Some Challenges
The addition of Hadoop infrastructure as a component of the EES system is not without challenges. In particular, the industry is relatively inexperienced with Hadoop infrastructures and does not have the same deep history it has with relational database systems. The performance of queries and reporting with the Hadoop system needs to be comparable to the experience of the relational systems to ensure adoption.
In addition, the Hadoop infrastructure has requirements that differ from the relational system, and providing these capabilities--such as the data security model and controlled data access--will need to be resolved as customers adopt these solutions.
Large-scale analytical data processing using the Hadoop platform has great potential to address the explosive data growth in the semiconductor industry. While low-cost storage and data processing on Hadoop enables collecting vast amounts of sensor data, putting this data to use requires the development of appropriate data formats, schemas and query engines that can enable semiconductor manufacturers to best take advantage of it.
For additional information, contact firstname.lastname@example.org.