The Implications of Facebook’s Platform Policies on Master Data Management

Organizations need to review Facebook’s Platform Policies before attempting to integrate Facebook data with customer master data. While this blog does not intend to provide legal advice, I have mapped the relevant Facebook Platform Policies as of August 8, 2012 with their implications for master data management.

Topic Relevant Facebook Platform Policies as of August 8, 2012 Implications for Master Data Management
1 A user’s friends’ data can only be used in the context of the user’s experience on your application. Organizations cannot use data on a person’s friends outside of the context of the Facebook application (e.g., using Facebook friends to add new relationships within MDM).
2 Subject to certain restrictions, including on transfer, users give you their basic account information when they connect with your application. For all other data obtained through use of the Facebook API, you must obtain explicit consent from the user who provided the data to us before using it for any purpose other than displaying it back to the user on your application. Organizations need to obtain explicit consent from the user before using any information other than basic account information (name, email, gender, birthday, current city, and the URL of the profile picture). 
3 You will not use Facebook user IDs for any purpose outside your application (e.g., your infrastructure, code, or services necessary to build and run your application). Facebook user IDs may be used with external services that you use to build and run your application, such as a web infrastructure service or a distributed computing platform, but only if those services are necessary to running your application and the service has a contractual obligation with you to keep Facebook user IDs confidential. Organizations can use Facebook user IDs within MDM to power a Facebook app. However, organizations cannot use these Facebook IDs outside of the context of a Facebook app.
4 If you stop using Platform or we disable your application, you must delete all data you have received through use of the Facebook API unless: (a) it is basic account information; or (b) you have received explicit consent from the user to retain their data.  Organizations need to be very careful about merging Facebook data with other data within their MDM environment. Consider a situation where an organization merged “married to” information from a user’s Facebook profile into their MDM system. If the organization stops using the Facebook Platform, it will need to obtain explicit permission from the user to retain this information. This can be problematic when the organization has merged Facebook data into a golden copy that has been propagated across the enterprise.
5 You cannot use a user’s friend list outside of your application, even if a user consents to such use, but you can use connections between users who have both connected to your application. Similar issues to topic 1 above.
6 You will delete all data you receive from us concerning a user if the user asks you to do so, and will provide an easily accessible mechanism for users to make such a request. We may require you to delete data you receive from the Facebook API if you violate our terms. Similar issues to topic 4 above.


Posted in Uncategorized | Leave a comment

Big Data Reference Architecture

A Reference Architecture for Big Data must include a Focus on Governance and Integration with an Organization’s Existing Infrastructure

 Reference architecture for big data.

There is a lot of hype about technologies like Apache Hadoop and NoSQL because of their ability to help organizations gain insights from vast quantities of high velocity, semi-structured, and unstructured data in a cost-effective manner. However, big data does not give IT a license to “rip and replace,” and CIOs want to understand how these technologies will interact with the organization’s technical architecture. The figure above describes a reference architecture for big data. I will discuss each component of the reference architecture in this article.

I want to provide just one caveat before I get started. There are a number of vendors offering a dizzying array of offerings for big data. It is not possible for me to cover every vendor and every offering in this article.

1.      Big Data Sources

Big data types include web and social media, machine-to-machine, big transaction data, biometrics, and human generated data. This data may be in structured, unstructured, and semi-structured formats.

2.      Hadoop Distributions

Because it consists of a bewildering array of technologies with their own release schedules, Hadoop can be somewhat intimidating to the novice user. A number of vendors have created their own commercial distributions of Apache Hadoop that have undergone release testing, and bundle product support and training. Most enterprises that have deployed Hadoop for commercial use have selected one of the Hadoop distributions. Standalone vendors who offer Hadoop distributions include Cloudera, MapR, and Hortonworks. In addition, IBM offers a Hadoop distribution called InfoSphere BigInsights. Amazon Web Services offers a Hadoop framework that is part of a hosted web service called Amazon Elastic MapReduce. EMC offers a Hadoop distribution called Greenplum HD. Microsoft has also announced the availability of the community technology preview of its Hadoop distribution as a cloud-based service on Windows Azure as well as an on-premise version on Windows Server.

3.      Streaming Analytics

Hadoop is well suited to handle large volumes of data at rest. However, big data also involves high velocity data in motion. Streaming analytics, also known as complex event processing (CEP), refers to a class of technologies that leverage massively parallel processing capabilities to analyze data in motion as opposed to landing large volumes of data to disk. There are a number of open source and vendor tools in this space. For example, Apache Flume is an incubator effort that uses streaming data flows to collect, aggregate, and move large volumes of data into the Hadoop Distributed File System (HDFS). IBM offers a tool called InfoSphere Streams that grew out of early work with the United States government. StreamBase, SAP Sybase Event Stream Processor, and Informatica RulePoint also offer CEP engines.

4.      Databases

Enterprises have the ability to select from multiple database approaches:


NoSQL (“not only SQL”) databases are a category of database management systems that do not use SQL as their primary query language. These databases may not require fixed table schemas, and do not support join operations. These databases are optimized for highly scalable read-write operations rather than for consistency. NoSQL databases include a vast array of offerings such as Apache HBase, Apache Cassandra, MongoDB, Apache CouchDB, Couchbase, Riak, and Amazon DynamoDB, which forms part of the Amazon Web Services software as a service platform. DataStax offers an Enterprise edition that includes a Hadoop distribution, and replaces HDFS with the CassandraFS.


In-memory database management systems rely on main memory for data storage. Compared to traditional database management systems that store data to disk, in-memory databases are optimized for speed. In-memory databases will become increasingly important as organizations seek to process and analyze massive volumes of big data. SAP HANA, Oracle TimesTen In-Memory Database, and IBM solidDB are all examples of in-memory databases.

Apache Sqoop is a tool that allows bulk transfer of data between Hadoop and relational databases. In addition, software vendors are also upgrading their database offerings to co-exist with Hadoop as shown below:

  • Oracle – The Oracle Loader for Hadoop uses MapReduce jobs to create data sets that are optimized for loading and analytics within the Oracle relational databases. Oracle Loader for Hadoop uses the CPUs in the Hadoop cluster to format the data for Oracle relational databases. The Oracle Direct Connector forHDFSallows high-speed access to HDFS data from an Oracle database. The data stored in HDFS can then be queried via SQL in conjunction with data within the Oracle relational database.
  • IBM – IBM InfoSphere BigInsights includes a set of Java-based user-defined functions (UDFs) that enable integration with IBM DB2 using SQL.
  • Microsoft – Microsoft offers a bi-directional Hadoop connector for SQL Server.


Legacy database management systems rely on non-relational approaches to database management. Vendors will increasingly re-tool these systems to support big data. For example, the IBM DB2 Analytics Accelerator for z/OS leverages the IBM Netezza appliance to speed up queries issued against a mainframe-based data warehouse running IBM DB2 for z/OS.

5.      Big Data Integration

Big data integration technologies fall into a few different categories:

Bulk data movement

Bulk data movement includes technologies such as ETL to extract data from one or more data sources, transform the data, and load the data into a target database. IBM InfoSphere Data Stage, Version 8.7 adds the new Big Data File stage that supports reading and writing multiple files in parallel from and to Hadoop. Informatica PowerCenter has connectors for Twitter, Facebook, and LinkedIn. Informatica PowerExchange has released a Hadoop adapter that moves data from source systems into HDFS and out of HDFS into business intelligence and data warehousing environments. Informatica HParser is a data transformation tool optimized for Hadoop. Informatica’s intention is to allow users to design data integration tasks in HParser and then run them natively on Hadoop without coding. Open source data integration vendors such as Pentaho and Talend are also capturing market share with customers who like their cost effective offerings and Hadoop integration.

Data replication

Replication technologies like change data capture can capture big data, such as utility smart meter readings, in near real time with minimal impact to system performance. Replication tools include IBM InfoSphere Data Replication and Oracle GoldenGate. Informatica’s data replication tools including Fast Clone and Data Replication offer high volume replication of data to and from Hadoop.

Data virtualization
Data virtualization is also known as data federation. Data virtualization allows an application to issue SQL queries against a virtual view of data in heterogeneous sources such as in relational databases, XML documents, and on the mainframe. Vendors include IBM (InfoSphere Federation Server), Informatica (Data Services), Denodo, and Composite Software.

6.      Text Analytics

Organizations increasingly want to derive insights from large volumes of unstructured content within call center agents’ notes, social media, IT logs, and medical records. Text analytics is a method for extracting usable knowledge from unstructured text data through the identification of core concepts, sentiments, and trends, and then using this knowledge to support decision-making. SAS Text Analytics and Oracle Endeca Information Discovery offer text analytics capabilities. IBM’s text analytics capabilities are embedded in a number of products including IBM SPSS Text Analytics for Surveys, IBM InfoSphere BigInsights, IBM InfoSphere Streams, IBM Cognos Consumer Insight, IBM Content Analytics, IBM Content and Predictive Analytics for Healthcare, and IBM eDiscovery Analyzer. Clarabridge is a standalone vendor that offers text analytics of surveys, emails, social media, and call center agents’ notes to support customer experience analytics.

7.      Big Data Discovery

Vendor tools such as IBM InfoSphere Discovery and Information Analyzer, Oracle Enterprise Data Quality Profile and Audit, Informatica Data Explorer, Trillium Software’s TS Discovery, SAP BusinessObjects Data Services, and SAS DataFlux Data Management Studio supply traditional data profiling and discovery projects for structured data at rest. Informatica has also announced its intention to release native Hadoop capabilities for data discovery. We anticipate that other vendors will follow suit. Organizations also need to consider tools for search and discovery of unstructured data. These tools include Oracle Endeca Information Discovery, IBM Vivisimo, and the Google Search Appliance.

8.      Big Data Quality                                                                          

Data quality management is a discipline that includes the methods to measure and improve the quality and integrity of an organization’s data. Traditional data quality tools include IBM InfoSphere QualityStage, Informatica Data Quality, Oracle Enterprise Data Quality, Trillium Software TS Quality, SAS DataFlux Data Management Studio, and SAP BusinessObjects Data Quality Management. However, big data quality will require radically different approaches from a technology perspective. For example, organizations may need to consider the following approaches:

  • Address data quality natively within Hadoop. Informatica has announced its intention to release native Hadoop capabilities for data quality. We anticipate that other vendors will follow suit.
  • Leverage unstructured content to improve the quality of sparse data. For example, a hospital used text analytics to improve the quality of structured data attributes such as “smoker” and “drug and alcohol abuse.” As a result, the hospital improved its ability to identify patients who were most likely to be readmitted within 30 days of treatment for congestive heart failure.
  • Use CEP to improve data quality in real-time without landing data to disk. For example, a telecommunications operator used CEP to de-duplicate telecommunications call detail records in real time, a process known as mediation.

9.      Metadata

Metadata is information that describes the characteristics of any data artifact, such as its name, location, perceived importance, quality, or value to the enterprise, and its relationships to other data artifacts that the enterprise deems worth managing. Big data expands the volume, velocity, and variety of information while adding new challenges in building and maintaining a coherent metadata infrastructure. The HCatalog project (formerly known as Howl) is now part of the Apache Incubator. HCatalog is built on top of the Hive metastore and aims to address the lack of metadata support within Hadoop. A number of vendors have metadata offerings including IBM InfoSphere Business Glossary and Metadata Workbench, Informatica Metadata Manager and Business Glossary, Adaptive Metadata Manager, and ASG-Rochade. Organizations need to add big data-related business terms to their business glossaries. As organizations store more of their data within Hadoop, they will need to address data lineage and impact analysis within this environment as well.

10.  Information Policy Management

Information governance is all about managing information policies. Whether they recognize it or not, organizations grapple with five important processes relating to information policies:

i.        Documenting policies relating to data quality, metadata, privacy, and information lifecycle management. For example, a big data policy might state that call center agents should not record social security numbers in their notes.

ii.      Assigning roles and responsibilities such as data stewards, data sponsors, and data custodians.

iii.    Monitoring compliance with the data policy. In the abovementioned example, the organization might use text analytics tools to identify instances where call center agents’ notes contain social security numbers.

iv.     Defining acceptable thresholds for data issues. In the example, the information governance team might determine that the acceptable threshold needs to be zero instances because of the potential privacy implications of having social security numbers in clear text.

v.       Managing issues especially those that are long-lived and affect multiple functions and lines of business. Taking the example further, the information governance team might create a number of trouble tickets so that the customer service team can eliminate any mentions of social security numbers within agents’ notes.

Most organizations have been approaching information governance policies in a manual fashion. However, vendors now offer tools to automate the process of managing policy for all types of information including big data. Tools in this space include Kalido Data Governance Director and SAP BusinessObjects Information Steward. Organizations that have made the investment in governance, risk and compliance platforms like IBM OpenPages and EMC RSA Archer eGRC may also elect to extend these tools to document operational controls and to monitor compliance with information policies. Finally, some organizations may also choose to use an existing issue management tool like BMC Remedy to handle data-related issues, although these tools are not specifically targeted at this problem domain.

11.  Master Data Management

Organizations may want to enrich their master data with additional insight from big data. For example, they might want to link social media sentiment analysis with master data to understand if a certain customer demographic is more favorably disposed to the company’s products. Major vendor’s offerings include IBM InfoSphere Master Data Management, Oracle Master Data Management, SAP NetWeaver Master Data Management, and Informatica Master Data Management. Informatica has built a compelling demo to highlight the integration of MDM with Facebook. We anticipate that other vendors will also support integration with social media as part of the so-called “social MDM.” Organizations will also need well-governed, clean reference data such as codes for gender, countries, states, currencies, and diseases, to support their big data projects. All the major MDM vendors also offer tools to manage reference data.

12.  Data Warehouses and Data Marts

Organizations have large investments in data warehouses and data marts that may be based on relational databases (such as Oracle Database 11g and IBM DB2), columnar databases (such as SAP Sybase IQ and ParAccel), and data warehousing appliances (such as Oracle Exalytics In-Memory Machine, IBM Netezza, HP Vertica, and EMC Greenplum). The Teradata Aster MapReduce Appliance offers the ability to use SQL with a MapReduce analytics engine on a Teradata hardware platform.

As organizations adopt big data, they will increasingly follow a blended approach to integrate Hadoop and other NoSQL technologies with their traditional data warehousing environments. A large organization generated significant volumes of clickstream data from its web presence. The clickstream data had the following characteristics:

  • Data was in XML format.
  • Each user session generated large volumes of data.
  • The data was sparse and there was only a small amount of insight to be gained from vast quantities of information.
  • Licensing fees made it cost prohibitive to handle the raw clickstream data within the data warehouse.
  • The business intelligence team found it difficult to parse the XML data with their current ETL tool.

The business intelligence team used Hadoop to analyze user browsing patterns within the clickstream data. However, the team needed to marry the browsing data with the sales information in the limited number of cases where the user actually made a purchase. Because the sales information was in the data warehouse, the business intelligence team decided to use ETL to move the clickstream data for actual buyers from Hadoop into the data warehouse.

13.  Big Data Analytics and Reporting

A number of open source and vendor tools can support big data analytics and reporting.

Visualization and Reporting

Vendors such as SAS, IBM (Cognos), SAP (BusinessObjects), Tableau, QlikView, and Pentaho have offerings that can visualize and analyze big data. Vendors’ product roadmaps increasingly offer the visualization and reporting of large datasets in Hadoop. For example, SAP has demonstrated the ability to display federated queries within BusinessObjects across Hadoop and HANA instances in the cloud.

Generalized predictive analytics tools

Analytics models will increasingly incorporate big data types. For example, a predictive model for insurance claims fraud might incorporate social media relationships. Vendors are starting to address this requirement within their product roadmaps. According to recent SAS updates, the SAS/Access Interface to Hadoop allows SAS users to treat Hive as just another data source similar to relational databases, data warehousing appliances, and hierarchical databases. SAS Hadoop support allows users to submit Pig, MapReduce, and HDFS commands from within the SAS environment. SAS also provides the ability to create UDFs that can be deployed within HDFS. This includes the ability to use SAS Enterprise Miner to take analytical scoring code and produce a UDF that can be deployed within HDFS and accessed by Hive, Pig, or MapReduce code. Microsoft has also been surprisingly aggressive with Hadoop support. Microsoft’s Hive ODBC driver enables users of Microsoft SQL Server Analysis Services, PowerPivot, and Power View to interact with Hadoop data. In addition, Microsoft’s Hive add-on for Excel enables users to interact with Hadoop data from a spreadsheet environment. Finally, R is an open source package that is often used to conduct statistical analyses of large datasets in Hadoop.

Social listening

A slew of vendors such as Attensity, Lithium, and Salesforce Radian6 offer tools to address so-called “social listening” requirements. In addition, mega vendors such as IBM with Cognos Consumer Insight, Oracle with Collective Intellect, and SAS with Social Media Analytics also have offerings in this space.

Specialized analytics

A number of vendors offer specialized tools for big data. One notable example is Splunk that offers tools for analytics of machine-to-machine data from applications and network logs to reduce application downtime and improve network security.

14.  Big Data Security and Privacy

Much has already been said about the issues relating to Hadoop security. Hadoop is still an emerging technology and we anticipate that these issues will be resolved as large companies and vendors get involved. We discuss two important technologies relating to data security and privacy. To the best of our knowledge, these tools do not support Hadoop today. However, we anticipate that vendors will include Hadoop support in their product roadmaps.

Data Masking

These tools are critical to de-identify sensitive information, such as birth dates, bank account numbers, street addresses, and Social Security numbers. Tools in this space include IBM InfoSphere Optim Data Masking Solution and Informatica Data Masking solutions.

Database Monitoring

These tools enforce separation of duties and monitor access to sensitive big data by privileged users. For example, telecommunications operators can use database monitoring to monitor access to sensitive call detail records which reveal subscribers’ calling patterns. In addition, utilities can use these tools to monitor access to smart meter readings that reveal when consumers are in and out of their homes. The database monitoring functionality must have a minimal impact on database performance and should not require any changes to databases or applications. Vendors include IBM (InfoSphere Guardium) and Imperva.

15.  Big Data Lifecycle Management

Information lifecycle management (ILM) is a process and methodology for managing information through its lifecycle, from creation through disposal, including compliance with legal, regulatory, and privacy requirements. The components of a big data lifecycle management platform are listed below:

Information archiving

As big data volumes grow, organizations need solutions that enable efficient and timely archiving of structured and unstructured information while enabling its discovery for legal requirements, and its timely disposition when no longer needed by the business, legal, or records stakeholders. We discuss three types of big data below:

  • Social media – This big data types is subject to retention policies driven by e-Discovery and regulations from authorities such as the FINRA in the United States. A recent blog indicated that this trend is driving an entirely new class of social media archiving tools from vendors such as Arkovi, Backupify, Cloud Preservation, Erado, Hanzo Archives, and PageFreezer.
  • Big transaction data and machine-to-machine data – RainStor uses data compression techniques to reduce the volume of big data. RainStor delivers two editions of its product to manage massive volumes of structured, semi-structured, and unstructured data such as telephone CDRs, utility smart meter readings, and log files.
  • Hadoop – Organizations are also discovering the value of Hadoop as a cost-effective archive for applications such as email.

Vendor offerings such as Symantec Enterprise Vault, HP Autonomy Consolidated Archive, IBM Smart Archive, and EMC SourceOne are positioned as unified archives for a variety of data types.

Records and retention management

Every ILM program must maintain a catalog of laws and regulations that apply to information in the jurisdictions in which a business operates. These laws, regulations, and business needs drive the need for a retention schedule that determines how long documents should be kept and when they should be destroyed. Records management solutions enforce a business process around document retention. Vendor tools include IBM Enterprise Records, EMC Documentum Records Manager, HP Autonomy Records Manager, and OpenText Records Management.

Legal Holds and Evidence Collection (eDiscovery)

Most corporations and entities are subject to litigation and governmental investigations that require them to preserve potential evidence. Large entities may have hundreds or thousands of open legal matters with varying obligations for data. Data sources include email, instant messages, excel spreadsheets, PDF documents, audio, video, and social media. Vendor tools include Symantec Enterprise Vault, HP Autonomy eDiscovery, IBM eDiscovery Manager, Recommind Axcelerate eDiscovery Suite, Nuix eDiscovery, ZyLAB eDiscovery and Production System, and Guidance Software EnCase eDiscovery.

Test Data Management

The big data governance program needs tools to streamline the creation and management of test environments, subset and migrate data to build realistic and right-sized test databases, mask sensitive data, automate test result comparisons, and eliminate the expense and effort of maintaining multiple database clones. IBM InfoSphere Optim Test Data Management Solution and Informatica Data Subset streamline the creation and management of test environments.

16.  Cloud

Organizations are also turning to the cloud because of perceived flexibility, faster time-to-deployment, and reduced capital expenditure requirements. A number of vendors offer big data platforms in the cloud and we list a few examples below:

Mega cloud vendors

Amazon Web Services offers a Hadoop framework that is part of a hosted web service called Amazon Elastic MapReduce. The Google Cloud Platform allows organizations to build applications, store large volumes of data, and analyze massive datasets on Google’s computing infrastructure.

Data brokers
Data brokers include companies such as Acxiom, Reed Elsevier, Thomson Reuters, and, literally, thousands of others that specialize by dataset and industry. These companies offer many types of data enrichment and validation services to organizations.

Mega IT vendors
HP Converged Cloud enables organizations to move between private, hybrid, and public cloud services.

Information management software vendors
Offerings such as Trillium Software TS Quality on Demand and SAS DataFlux Marketplace provide validation, cleansing, and enrichment of name, email, and address as a service. Informatica Cloud provides data loading, synchronization, profiling, and quality services for Salesforce and other cloud applications.

In summary, big data has game changing potential with the advent of new data types and emerging technologies such as Hadoop, NoSQL, and streaming analytics. To take advantage of these developments, organizations need to create a reference architecture that integrates these emerging technologies into their existing infrastructure. As always, I would appreciate your feedback. Please feel free to leave a comment, send me an email at, or find me on Twitter at @sunilsoares1.

Posted in Big Data, Big Data, Data Governance, Sunil Soares | Tagged , , , , , | 6 Comments

IBM InfoSphere for Big Data Integration and Governance

Big Data is generally characterized in terms of the three “V’s” of volume, velocity and variety. IBM is building out a big data platform for both big data at rest and big data in motion. The core components of this platform include:

  •  IBM InfoSphere BigInsights, an enterprise-ready Hadoop distribution from IBM,
  •  IBM InfoSphere Streams, a complex event processing (CEP) engine that leverages massively parallel processing capabilities to analyze big data in motion, and
  • The IBM Netezza, a family of data warehouse appliances that support parallel in-database analytics for fast queries to very large datasets.

This blog discusses how IBM InfoSphere provides integration and governance capabilities as part of the IBM big data platform.

1. Big Data Integration

IBM InfoSphere DataStage, Version 8.7 adds the new Big Data File stage that supports reading and writing multiple files in parallel from and to Hadoop to simplify how data is merged into a common transformation process. IBM InfoSphere Data Replication supports data replication including change data capture, which can be used to integrate rapidly changing data such as utility smart meter readings. IBM InfoSphere Federation Server supports data virtualization.

2. Big Data Search and Discovery

IBM InfoSphere Information Analyzer and IBM InfoSphere Discovery can be used to profile structured data as part of a big data governance project. Before building a CEP application using IBM InfoSphere Streams, big data teams need to understand the characteristics of the data. For example, to support the real time analysis of Twitter data, the big data team would use the Twitter API to download a sample set of Tweets for further analysis. The profiling of streaming data is similar to traditional data projects using IBM InfoSphere Information Analyzer and IBM InfoSphere Discovery. Both types of projects need to understand the characteristics of the underlying data, such as the frequency of null values. IBM also recently announced the acquisition of Vivisimo for search and discovery of unstructured content. For example, an insurance carrier can reduce average handling time by providing call center agents with searchable access to multiple document repositories for customer care, alerts, policies and customer information files.

3. Big Data Quality

IBM InfoSphere QualityStage can be used to cleanse structured data as part of a big data governance project. In addition, developers can write MapReduce jobs in IBM InfoSphere BigInsights to address data quality issues for unstructured and semi-structured data in Hadoop. Finally, IBM InfoSphere Streams can deal with streaming data quality issues such the rate of arrival of data. If IBM InfoSphere Streams “knows” that a particular sensor creates events every second, then it can generate an alert if it does not receive an event after 10 seconds.

4. Metadata for Big Data

IBM InfoSphere Business Glossary and IBM InfoSphere Metadata Workbench manage business and technical metadata. The information governance team needs to extend business metadata to cover big data types. For example, the term “unique visitor” is a fundamental building block of clickstream analytics and is used to measure the number of individual users of a website. However, two sites may measure unique visitors differently, with one site counting unique visitors within a week while another one may measure unique visitors within a month.

5. Master Data Management

High quality master data is a critical enabler of a big data program. For example, the risk department at a bank can use text analytics on SEC 10-K and 10-Q financial filings to update customer risk management hierarchies in real time as ownership positions change. In another example, the operations department might use sensor data to identify defective equipment but needs consistent asset nomenclature if it wants to also replace similar equipment in other locations. Finally, organizations might embark on a “social MDM” program to link social media sentiment analysis with master data to understand if a certain customer demographic is more favorably disposed to the company’s products. IBM has taken the first step towards unifying its diverse MDM offerings including Initiate and MDM for Product Information Management under the banner of IBM InfoSphere Master Data Management V10.

6. Reference Data Management

The importance of high quality reference data for big data cannot be understated. For example, an information services company used a number of data sources like product images from the web, point of sale transaction logs and store circulars to improve the quality of its internal product master data. The company used web content to validate manufacturer-provided Universal Product Codes, which are 12-digit codes represented as barcodes on products in North America. The company also validated product attributes such as for a shampoo that was listed as 4 oz. on the web versus 3.8 oz. in the product master. To support its master data and data quality initiatives, the company maintained reference data including thousands of unique values for color. For example, it maintained reference data to indicate that “RED,” “RD,” and “ROUGE” referred to the same color. IBM InfoSphere MDM Reference Data Management Hub manages reference data such as codes for countries, states, industries, and currencies.

7. Big Data Security and Privacy

IBM offers a number of tools for big data security and privacy:

  • Data Masking – IBM InfoSphere Optim Data Masking Solution applies a variety of data transformation techniques to mask sensitive data with contextually accurate and realistic information.
  • Database activity monitoring – IBM InfoSphere Guardium creates a continuous, fine-grained audit trail of all database activities, including the “who,” “what,” “when,” “where” and “how” of each transaction. This audit trail is continuously analyzed and filtered in real-time, to identify unauthorized or suspicious activities, including by privileged users. This solution can be deployed in a number of situations that monitor access to sensitive big data. For example, telecommunications operators can use Guardium to monitor access to sensitive call detail records which reveal subscribers’ calling patterns. In addition, utilities can use Guardium to monitor access to smart meter readings that reveal when consumers are in and out of their homes.

8. Big Data Lifecycle Management

IBM has developed a robust big data lifecycle management platform:

  • Archiving – IBM Smart Archive includes software, hardware and services offerings from IBM. The solution includes IBM InfoSphere Optim and the IBM Content Collector family for multiple data types including email, file Systems, Microsoft SharePoint, SAP applications and IBM Connections. It also includes the IBM Content Manager and IBM FileNet Content Manager repositories.
  • eDiscovery – IBM eDiscovery Solution enables legal teams to define evidence obligations, coordinate with IT, records, and business teams, and reduce the cost of producing large volumes of evidence in legal matters. The solution includes IBM Atlas eDiscovery Process Management that enables legal professionals to manage a legal holds workflow.
  • Records and retention management – IBM Records and Retention Management helps an organization manage records according to a retention schedule.
  • Test Data Management – IBM InfoSphere Optim Test Data Management Solution streamlines the creation and management of test environments, subsets data to create realistic and right-sized test databases, masks sensitive data, automates test result comparisons, and eliminates the expense and effort of maintaining multiple database clones.

I anticipate that IBM will continue to build out its integration and governance capabilities for big data. Hopefully, this blog provides a useful reference for companies that already have or are considering IBM InfoSphere for big data. As always, comments are always welcome. You can also reach me at or on Twitter at @sunilsoares1.

Posted in Big Data, Big Data, Data Governance, Sunil Soares | Tagged , | Leave a comment

Big Data Governance Needs Robust Business Process Management

A number of clients have been talking about the importance of aligning key business processes with their information governance programs as well as emerging initiatives around big data. It is almost like BPM, MDM, information governance, and big data exist in their own silos. In this blog, I will endeavor to provide a framework to align these discrete initiatives.

An organization is built around its business processes, so it makes sense to start there. My IBM colleague, Ken Jacquier @kenjacquier, introduced me to IBM BlueworksLive, a SaaS-based BPM offering:!gettingStarted:overview

 Overview of the business process
Using BlueworksLive, I was able to create a very simple process “snippet” that describes the management of social media at a retailer with physical store locations. You can view this process below:

 Detailed business processes – milestones and activities
We discuss each milestone and activity in the process below:

1. Customer sets up RFID card
Retailers can now leverage RFID cards to enable their customers to interact with social media in the off-line world. In the past, customers would have to stop what they were doing, and use a smartphone to post a message on Facebook or Twitter.

1.1  Customer obtains an RFID card
The customer obtains an RFID card from the retailer

1.2  Customer links RFID card to Facebook page
The customer links the RFID card to their Facebook profile prior to the shopping experience.

1.3  Customer okays Facebook wall postings
The customer sets up their profile and allows the retailer to access their Facebook page and make short postings to document their experiences with the retailer.

2.  Customer completes store purchases
In this milestone, the customer completes her shopping experience at the retail store.

2.1  Customer places items in shopping basket
The customer places items in her shopping basket. Some of the merchandise may also contain RFID tags that help the retailer track inventory throughout the supply chain and on the store shelf.

2.2  Cashier asks customer for phone number
Some U.S. states allow retailers to ask for phone numbers at the point-of-sale. Retailers use this information to better understand their customers.

2.3  Cashier scans items
The cashier scans the merchandise in the customer’s shopping basket.

2.4  Cashier deactivates RFID merchandise tags
The cashier deactivates RFID-enabled merchandise at the point-of-sale. We discuss this topic further under big data governance policies.

 2.5  Customer scans RFID card
The customer scans their RFID card, which is the trigger for the retailer to make a posting on her Facebook wall.

2.6  Facebook: “I got 30% off shoes at XYZ”
The retailer posts a message to the customer’s Facebook wall that says: “I got 30 percent off shoes at XYZ.”

3.  Data aggregated in marketing data warehouse
The marketing department now pulls all the relevant content into the data warehouse for analytics.

3.1  Reverse append customers from phone numbers
Even if the customer pays in cash, marketing can do a “reverse append” to obtain the customer identifier based on her phone number. Marketing can then use the indentifier to add the customer’s purchases to her overall transaction history in the data warehouse.

3.2  Pull in data from Facebook friends
Marketing pulls down a list of the customer’s Facebook friends from her profile.

3.3  Marketing analyzes friends, locations, etc.Marketing uses information from the RFID card to understand the customer’s shopping behavior by location. Marketing can also determine if the customer’s Facebook postings resulted in incremental purchases by her friends.

 Mapping of big data governance policies to business processes
The next step is to map big data governance policies to the specific activities. We list the sequence number, the activity, and the associated big data governance policy below:

1.3  Customer okays Facebook wall postings
The retailer obtains informed consent from the customer to post messages to her Facebook wall and to use key information in her profile such as her friends.

2.2  Cashier asks customer for phone number
Marketing depends on high quality phone numbers to improve customer insight. Even if the customer pays in cash, marketing can do a “reverse append” to obtain the customer identifier based on their phone number. Marketing can then use the customer ID to add the customer’s purchases to their overall transaction history in the data warehouse. Store operations needs to train and incent the store associates to capture accurate phone numbers at the point of sale. Of course, there are several challenges with this approach. Some customers will decline to provide a phone number. In addition, some retailers have found that store associates try to meet their targets by entering the phone number of the store or a local hotel. Information governance needs to work with marketing and store operations to establish real-time validation rules to ensure that store associates enter appropriate phone numbers at the point of sale. It should be noted that phone numbers are not “big data” but they do affect the ability of marketing to associate social media to the customer profile.

2.4  Cashier deactivates RFID merchandise tags
RFID tags could potentially be used to profile and track individuals. Retailers who pass RFID tags on to customers without automatically deactivating or removing them at the checkout may unintentionally enable this risk.

3.1 Reverse append customers from phone numbers
This activity is closely tied to 2.2 above. The big data governance program also needs to establish metrics around the percentage of “poor” phone numbers, something that has a direct impact on the establishment of a customer identity that is important to marketing.

3.2  Pull in data from Facebook friends
Marketing needs to relate a person’s Facebook profile with their internal record. This may not always be easy. For example, a customer may have a friend named Susie Smith on Facebook. Marketing needs to determine if “Susie Smith” is related to the six instances of “Susan Smith” in their internal systems. This entity resolution may be even more difficult for Twitter because users often have cryptic handles. The big data governance team needs to establish policies to relate Facebook identities with internal customer profiles using attributes such as relationships, name, and address.

This example is for a simple process but the overall framework should also apply to more complex scenarios.

Posted in Uncategorized | Leave a comment

Big Data Governance

I am starting to see a convergence of two major trends in the marketplace: information governance and Big Data. We are coining the term “Big Data Governance” to reflect this emerging trend. I define Big Data Governance as the formulation of policy to optimize, secure, and leverage Big Data as an enterprise asset by aligning the objectives of multiple functions.

Here is the framework that I have developed to establish the scope of information governance:

  1. Master Data Governance
    These include a single view of customers, materials, vendors, employees and chart of accounts. Each data domain has specific attributes that need to be fit for purpose. For example, phone number is an important attribute for the customer data domain, because it is important for an enterprise to have valid contact information in case of need.
  2. Reference Data Governance
    These include data that is relatively static such as codes for countries, states or provinces, currencies, industries and customer segments.
  3. Big Data Governance
    These include social media (Twitter feeds, blogs, Facebook pages, LinkedIn profiles), cell phone GPS data, sensor data, weather data, etc. These data tend to be operational in nature and meet the three “V” criteria – volume, velocity, and variety.

 Most of my clients are implementing information governance programs today. These programs focus on the governance of master data and, to a lesser extent, reference data. Based on my conversations, I expect that clients will increasingly focus on the governance of big data in the next 12-18 months.

 Big Data Governance programs need to focus on issues that are similar to other information governance initiatives. For example, these programs need to address the following:

  • Information Lifecycle Management – Big Data programs need to ensure that storage costs do not spiral out of control.
  • Data Quality – Organizations need to establish what level of data quality is “good enough” because of the high volume and velocity of Big Data.
  • Metadata – Big Data Governance needs to create sound metadata to avoid situations such as where a company bought the same dataset twice because it was named differently within two different repositories.
  • Privacy – Enterprises need to be very specific about adherence to privacy concerns, such as leveraging social media analytics.

 All said and done, 2012 should be a breakout year for Big Data Governance programs.

Posted in Uncategorized | 3 Comments

The IBM Data Governance Unified Process

I recently published a book called the “IBM Data Governance Unified Process.”  As I state on the back cover, Data Governance can be like the blind men and the elephant.  Depending on which part of the elephant you touch, you will equate Data Governance with one or more of metadata, business glossaries, master data governance, analytics governance, security and privacy, and information life-cycle governance.

All these definitions are correct, but they are incomplete. Data Governance is about setting policy around data to treat is as an enterprise asset.  Your enterprise needs to protect, leverage, and optimize its data in much the same way as it would any tangible asset such as a building or equipment.

The book walks through a series of 14 steps and 93 sub-steps to implement a Data Governance program.

Posted in Uncategorized | 2 Comments