Big Data is generally characterized in terms of the three “V’s” of volume, velocity and variety. IBM is building out a big data platform for both big data at rest and big data in motion. The core components of this platform include:
- IBM InfoSphere BigInsights, an enterprise-ready Hadoop distribution from IBM,
- IBM InfoSphere Streams, a complex event processing (CEP) engine that leverages massively parallel processing capabilities to analyze big data in motion, and
- The IBM Netezza, a family of data warehouse appliances that support parallel in-database analytics for fast queries to very large datasets.
This blog discusses how IBM InfoSphere provides integration and governance capabilities as part of the IBM big data platform.
1. Big Data Integration
IBM InfoSphere DataStage, Version 8.7 adds the new Big Data File stage that supports reading and writing multiple files in parallel from and to Hadoop to simplify how data is merged into a common transformation process. IBM InfoSphere Data Replication supports data replication including change data capture, which can be used to integrate rapidly changing data such as utility smart meter readings. IBM InfoSphere Federation Server supports data virtualization.
2. Big Data Search and Discovery
IBM InfoSphere Information Analyzer and IBM InfoSphere Discovery can be used to profile structured data as part of a big data governance project. Before building a CEP application using IBM InfoSphere Streams, big data teams need to understand the characteristics of the data. For example, to support the real time analysis of Twitter data, the big data team would use the Twitter API to download a sample set of Tweets for further analysis. The profiling of streaming data is similar to traditional data projects using IBM InfoSphere Information Analyzer and IBM InfoSphere Discovery. Both types of projects need to understand the characteristics of the underlying data, such as the frequency of null values. IBM also recently announced the acquisition of Vivisimo for search and discovery of unstructured content. For example, an insurance carrier can reduce average handling time by providing call center agents with searchable access to multiple document repositories for customer care, alerts, policies and customer information files.
3. Big Data Quality
IBM InfoSphere QualityStage can be used to cleanse structured data as part of a big data governance project. In addition, developers can write MapReduce jobs in IBM InfoSphere BigInsights to address data quality issues for unstructured and semi-structured data in Hadoop. Finally, IBM InfoSphere Streams can deal with streaming data quality issues such the rate of arrival of data. If IBM InfoSphere Streams “knows” that a particular sensor creates events every second, then it can generate an alert if it does not receive an event after 10 seconds.
4. Metadata for Big Data
IBM InfoSphere Business Glossary and IBM InfoSphere Metadata Workbench manage business and technical metadata. The information governance team needs to extend business metadata to cover big data types. For example, the term “unique visitor” is a fundamental building block of clickstream analytics and is used to measure the number of individual users of a website. However, two sites may measure unique visitors differently, with one site counting unique visitors within a week while another one may measure unique visitors within a month.
5. Master Data Management
High quality master data is a critical enabler of a big data program. For example, the risk department at a bank can use text analytics on SEC 10-K and 10-Q financial filings to update customer risk management hierarchies in real time as ownership positions change. In another example, the operations department might use sensor data to identify defective equipment but needs consistent asset nomenclature if it wants to also replace similar equipment in other locations. Finally, organizations might embark on a “social MDM” program to link social media sentiment analysis with master data to understand if a certain customer demographic is more favorably disposed to the company’s products. IBM has taken the first step towards unifying its diverse MDM offerings including Initiate and MDM for Product Information Management under the banner of IBM InfoSphere Master Data Management V10.
6. Reference Data Management
The importance of high quality reference data for big data cannot be understated. For example, an information services company used a number of data sources like product images from the web, point of sale transaction logs and store circulars to improve the quality of its internal product master data. The company used web content to validate manufacturer-provided Universal Product Codes, which are 12-digit codes represented as barcodes on products in North America. The company also validated product attributes such as for a shampoo that was listed as 4 oz. on the web versus 3.8 oz. in the product master. To support its master data and data quality initiatives, the company maintained reference data including thousands of unique values for color. For example, it maintained reference data to indicate that “RED,” “RD,” and “ROUGE” referred to the same color. IBM InfoSphere MDM Reference Data Management Hub manages reference data such as codes for countries, states, industries, and currencies.
7. Big Data Security and Privacy
IBM offers a number of tools for big data security and privacy:
- Data Masking – IBM InfoSphere Optim Data Masking Solution applies a variety of data transformation techniques to mask sensitive data with contextually accurate and realistic information.
- Database activity monitoring – IBM InfoSphere Guardium creates a continuous, fine-grained audit trail of all database activities, including the “who,” “what,” “when,” “where” and “how” of each transaction. This audit trail is continuously analyzed and filtered in real-time, to identify unauthorized or suspicious activities, including by privileged users. This solution can be deployed in a number of situations that monitor access to sensitive big data. For example, telecommunications operators can use Guardium to monitor access to sensitive call detail records which reveal subscribers’ calling patterns. In addition, utilities can use Guardium to monitor access to smart meter readings that reveal when consumers are in and out of their homes.
8. Big Data Lifecycle Management
IBM has developed a robust big data lifecycle management platform:
- Archiving – IBM Smart Archive includes software, hardware and services offerings from IBM. The solution includes IBM InfoSphere Optim and the IBM Content Collector family for multiple data types including email, file Systems, Microsoft SharePoint, SAP applications and IBM Connections. It also includes the IBM Content Manager and IBM FileNet Content Manager repositories.
- eDiscovery – IBM eDiscovery Solution enables legal teams to define evidence obligations, coordinate with IT, records, and business teams, and reduce the cost of producing large volumes of evidence in legal matters. The solution includes IBM Atlas eDiscovery Process Management that enables legal professionals to manage a legal holds workflow.
- Records and retention management – IBM Records and Retention Management helps an organization manage records according to a retention schedule.
- Test Data Management – IBM InfoSphere Optim Test Data Management Solution streamlines the creation and management of test environments, subsets data to create realistic and right-sized test databases, masks sensitive data, automates test result comparisons, and eliminates the expense and effort of maintaining multiple database clones.
I anticipate that IBM will continue to build out its integration and governance capabilities for big data. Hopefully, this blog provides a useful reference for companies that already have or are considering IBM InfoSphere for big data. As always, comments are always welcome. You can also reach me at firstname.lastname@example.org or on Twitter at @sunilsoares1.