Global IDs Data Governance Product Suite

I had the opportunity to sit in on a presentation by Arka Mukherjee, CEO and Founder of Global IDs, at the Data Governance Winter Conference in Ft. Lauderdale in December 2012. The session had an intrigued title that captured my interest, “Merging Big Data with Enterprise Data.” Arka acknowledged that most of Global IDs use cases are around “small data” today. However, the company works with several Fortune 1000 companies including a large telecommunications operator.

Arka provided some useful background information on big data from the perspective of Global IDs:

  • Volume – Most of the big data use cases deal with petabytes versus megabytes of data
  • Format – Big data is primarily in unstructured format
  • Sparsity – Big data has very low signal to noise ratio, sort of like “searching for a needle in a needle stack”
  • NoSQL – Big data involves NoSQL database platforms like Cassandra, Hadoop, MongoDB, Neo4j, etc.

Arka went on to emphasize that big data requires some unique skills:

  • Data modeling with ontologies
  • Data analytics with R
  • Data movement in and out of NoSQL DBs
  • Data governance on Resource Description Framework (RDF) stores

Arka mentioned the following use cases around linking big data with enterprise master data:

  • Linking product master data with product safety websites
  • Linking enterprise reference data with Linked Open Data, which is the biggest integration project linking thousands of government databases with RDF
  • Linking supplier master data with news websites
  • Linking enterprise master data with third party data like market data, demographics, credit reports from Acxiom and D&B

Global IDs provides “extreme data governance” capabilities in large and complex data environments. They claim to take the problem away from humans, and let machines handle these processing-intensive tasks. Arka mentioned that a recent client had a million databases, and the average client will look at three to four million attributes.

An interesting company that I will definitely want to learn more about.

Posted in Uncategorized | Tagged , , | Leave a comment

Pentaho Big Data Platform

I’ve been hearing a blot of buzz from clients about Pentaho. One large bank mentioned that they loaded their Hadoop environment in seconds with Pentaho versus several minutes with other leading ETL tools. So I decided to do some research on Pentaho to see what all the fuss was about.

It turns out that Pentaho offers a tightly integrated suite for business intelligence and data integration.

Pentaho Business Analytics offers reporting, dashboarding, and analytical capabilities based on in-memory technology. The tool offers strong visualization capabilities including support for scatter plots, heat grids, and geo-mapping.

Pentaho Business Analytics leverages Pentaho Data Integration to provide analytics for data residing in relational databases, NoSQL stores, Hadoop, and business intelligence appliances. Pentaho Data Integration is an ETL tool that supports connectivity to OLTP, analytical and NoSQL databases such as Oracle, Greenplum, Teradata, Netezza, Apache Cassandra, and MongoDB as well as to unstructured and semi-structured sources such as Hadoop, Excel, XML, and RSS feeds.

The Pentaho platform comes in two basic flavors: a community edition that is open source, and an enterprise edition that includes product support and advanced features.

Pentaho has been getting strong traction with organizations due to its strong big data support, open source heritage, and cost effectiveness.

Posted in Uncategorized | 1 Comment

IBM Needs a Holistic Approach to Governing Big and Small Data on System z

IBM System z is the premier mainframe platform in the market today. Many of the largest enterprises in the world rely on System z for their business critical operations. At the same time, these organizations are also maturing their data governance and big data initiatives. However, in many cases, these organizations have not tied their data governance and big data initiatives back to the mainframe where most of their mission-critical data currently resides.

The light bulb went off for me when I was talking to the manager of data governance at a financial services institution. We were establishing information policies around email addresses. They had to quantify the number of missing email addresses but that data was on the mainframe. They wanted to set policies around customer duplicates but their customer information file was on the mainframe. Their CIO wanted to reduce storage costs. Most of that data was on the mainframe. Their Chief Information Security Officer (CISO) needed to set policies, but most of the data was (you guessed it) on the mainframe.

Now, let’s talk about big data. Yes, there’s lots of hype but I fully expect that some of those vast oceans of data will land on System z. An insurer recently mentioned that they were looking at a telematics program that placed sensors on automobiles. They also mentioned that this treasure trove of sensor data (big data) would probably end up in the System z environment.

IBM actually has a broad portfolio of tools for governing big and small data on System z. I have listed a few below:

  • Data Profiling – Assessing the current state of the data is often the first step in a data governance program. IBM InfoSphere Information Analyzer offers data profiling capabilities on System z.
  • Data Discovery – A large financial institution had thousands of VSAM files. They found that Social Security Numbers were hidden in a field called EMP_NUM. A data discovery tool like IBM InfoSphere Discovery helped them discover hidden sensitive data on the mainframe.
  • Business Glossary – IBM InfoSphere Business Glossary for Linux on System z enables large mainframe shops to deploy their business glossaries in a System z environment.
  • Data Archiving – I know of at least one large institution that had data from the 1950’s still sitting on the mainframe. IBM InfoSphere Optim Data Growth helped rationalize their MIPS costs by moving older data to less expensive storage.
  • Database Monitoring – Many large financial institutions want to monitor access by privileged users like DBAs to sensitive data on the mainframe. IBM InfoSphere Guardium now offers S-TAP support for IMS, VSAM and DB2 on z/OS.
  • Big Data – The IBM DB2 Analytics Accelerator for z/OS allows companies to accelerate analytical queries by leveraging the power of a Netezza appliance.

I am not suggesting that every data governance program needs to now focus on System z. However, there are many large IBM mainframe shops that are building out their data governance and big data programs. I do believe that these organizations would benefit from a better alignment between System z and their data governance and big data initiatives.

Posted in Uncategorized | Leave a comment

Automating the Management of Information Policies

Information policies are a crucial deliverable from any data governance program. Whether they recognize it or not, organizations grapple with five important processes relating to information policies:

  1. Documenting policies relating to data quality, metadata, privacy, and information lifecycle management. For example, an information policy might state that call center agents need to search for a customer name before creating a new record.
  2. Assigning roles and responsibilities such as data stewards, data sponsors, and data custodians.
  3. Monitoring compliance with the information policy. In the abovementioned example, the organization might measure the number of duplicate customer records to monitor adherence by call center agents to the information policy.
  4. Defining acceptable thresholds for data issues. In the example, the data governance team may determine that three percent duplicates are an acceptable threshold for customer data quality because it is uneconomical to pursue issue resolution when the percentage of duplicates falls below that level.
  5. Managing issues especially those that are long-lived and affect multiple functions and lines of business. Taking the example further, the data governance team may create a number of trouble tickets so that the customer data stewards can eliminate duplicate records.

Most organizations have a manual approach to information policy management. This approach may work well at the start but quickly becomes unstainable as the number of information policies increases.

Several software vendors have been trying to offer tools to automate the process of managing policy for all types of information. The functionality of these tools depends on their heritage. We classify these tools into the following categories:

  1. Standalone tools – Kalido Data Governance Director would be noteworthy in this category.
  2. Tools with a business glossary heritage – Collibra Business Semantics Glossary and IBM InfoSphere Business Glossary are primarily focused on managing business terms. However, the functionality of these tools around stewardship, categories and workflows means that they can also be extended to managing information policies.
  3. Enterprise applications – Enterprise applications offer a mechanism to manage information policies in the context of key business processes. I have been writing extensively about this topic.
    As an example, SAP BusinessObjects Information Steward provides targeted capabilities for data stewards to manage and monitor data quality scorecards, data validation business rules, business definitions, and metadata. In addition, SAP Master Data Governance provides capabilities to enforce information policies in the context of business processes.
  4. GRC tools – Many organizations have made significant investments in Governance, Risk and Compliance (GRC) platforms like IBM OpenPages and EMC RSA Archer eGRC. These organizations may also elect to extend these tools to document operational controls and to monitor compliance with information policies.
  5. Issue management platforms – I know of at least one organization may also choose to use an existing issue management tool like BMC Remedy to track data-related issues although these tools are not specifically targeted at this problem domain.

I intend to write more about this topic. Have I missed anything? Please comment.

Posted in Uncategorized | Leave a comment

Collibra Business Semantics Glossary

I have been hearing some buzz about Collibra’s data governance software over the past year or so. The chatter came from sales reps, clients and analysts. I’ve also run into the Collibra folks at different conferences. So I decided to do a bit of research on the company. They are VC-backed and based out of Belgium but with operations in the U.S. and Northern Europe. I had a couple of meetings with Stijn Christiaens, COO and Co-Founder, and Benny Verhaeghe, Director Sales & Marketing. Stijn just did a couple of demonstrations of their software.

I must say that I really like it. The software does a really nice job of what it is meant to do. It has a nice, intuitive interface and provides a workflow to manage business terms and data governance policies. It also lets an administrator set up roles within the organization such as the data governance council and data stewards. The whole area of information policy management is underserved by existing software vendors, and Collibra makes some strides in this regard.

Of course, Collibra does not try to become a full-fledged data governance platform. The tool does not yet offer support for technical metadata, data profiling, data quality or master data management. But it is a nice entry-level tool for organizations who want to get started with data governance.

Posted in Data Governance, Sunil Soares, Uncategorized | Tagged | Leave a comment

First of a Kind Industry Training: Data Governance Fundamentals for Health Plans

I strongly believe that industry-orientation and verticalization is the next big step in the evolution of data governance. Most data governance training programs deal with best practices in a cross-industry manner. However, data governance in banking is completely different from data governance in health plans. I am already doing a public course on Data Governance Fundamentals in Chicago February 5-6, 2013

So I got around to thinking that it would be valuable to put something like this together for a specific industry. Wouldn’t it be nice if health plans had a data governance template that was specific to their industry? A data governance charter for health plans. A sample data governance organization for health plans with roles and responsibilities. A sample data quality scorecard with member and provider KPIs. A sample business case.

My last two books “Selling Information Governance to the Business” and “Big Data Governance” both have chapters on healthcare. Based on these books and my own consulting experience, I have developed a two-day training class tailored to health plans. I have already been delivering this class over the past several months.

The topics for the two-day class are as follows:

  1. Overview of data governance in health plans
  2. Building the business case for data governance in health plans (e.g., Member 360)
  3. Member Data Governance
  4. Provider Data Governance
  5. Organizing for data governance (e.g., sample data governance charter for health plans, sample data governance organization including Medical Informatics, Member Services, Network Management, Marketing, Finance and Privacy)
  6. Data Stewardship Fundamentals
  7. Writing data governance policies (e.g., sharing claims data with external parties)
  8. Building a business glossary
  9. Creating a data quality scorecard
  10. Information lifecycle governance overview (e.g., defensible disposition, test data management, archiving and data retention)
  11. Aligning with Security and Privacy (e.g., HIPAA compliance, aligning with the chief information security officer and chief privacy officer, leveraging data discovery to discover sensitive data)
  12. Big data governance
  13. Reference architecture for data governance

At the end of the class, participants will have a binder of courseware that is specific to health plans. I have taken great pains to use only healthcare examples and content.

I will be emphasizing the industry-orientation of data governance during my workshop on Industry Best Practices at the Data Governance Winter Conference in Ft. Lauderdale in a few weeks.

These are truly exciting times.

Posted in Uncategorized | Leave a comment

The Implications of Facebook’s Platform Policies on Master Data Management

Organizations need to review Facebook’s Platform Policies before attempting to integrate Facebook data with customer master data. While this blog does not intend to provide legal advice, I have mapped the relevant Facebook Platform Policies as of August 8, 2012 with their implications for master data management.

Topic Relevant Facebook Platform Policies as of August 8, 2012 Implications for Master Data Management
1 A user’s friends’ data can only be used in the context of the user’s experience on your application. Organizations cannot use data on a person’s friends outside of the context of the Facebook application (e.g., using Facebook friends to add new relationships within MDM).
2 Subject to certain restrictions, including on transfer, users give you their basic account information when they connect with your application. For all other data obtained through use of the Facebook API, you must obtain explicit consent from the user who provided the data to us before using it for any purpose other than displaying it back to the user on your application. Organizations need to obtain explicit consent from the user before using any information other than basic account information (name, email, gender, birthday, current city, and the URL of the profile picture). 
3 You will not use Facebook user IDs for any purpose outside your application (e.g., your infrastructure, code, or services necessary to build and run your application). Facebook user IDs may be used with external services that you use to build and run your application, such as a web infrastructure service or a distributed computing platform, but only if those services are necessary to running your application and the service has a contractual obligation with you to keep Facebook user IDs confidential. Organizations can use Facebook user IDs within MDM to power a Facebook app. However, organizations cannot use these Facebook IDs outside of the context of a Facebook app.
4 If you stop using Platform or we disable your application, you must delete all data you have received through use of the Facebook API unless: (a) it is basic account information; or (b) you have received explicit consent from the user to retain their data.  Organizations need to be very careful about merging Facebook data with other data within their MDM environment. Consider a situation where an organization merged “married to” information from a user’s Facebook profile into their MDM system. If the organization stops using the Facebook Platform, it will need to obtain explicit permission from the user to retain this information. This can be problematic when the organization has merged Facebook data into a golden copy that has been propagated across the enterprise.
5 You cannot use a user’s friend list outside of your application, even if a user consents to such use, but you can use connections between users who have both connected to your application. Similar issues to topic 1 above.
6 You will delete all data you receive from us concerning a user if the user asks you to do so, and will provide an easily accessible mechanism for users to make such a request. We may require you to delete data you receive from the Facebook API if you violate our terms. Similar issues to topic 4 above.


Posted in Uncategorized | Leave a comment

Big Data Reference Architecture

A Reference Architecture for Big Data must include a Focus on Governance and Integration with an Organization’s Existing Infrastructure

 Reference architecture for big data.

There is a lot of hype about technologies like Apache Hadoop and NoSQL because of their ability to help organizations gain insights from vast quantities of high velocity, semi-structured, and unstructured data in a cost-effective manner. However, big data does not give IT a license to “rip and replace,” and CIOs want to understand how these technologies will interact with the organization’s technical architecture. The figure above describes a reference architecture for big data. I will discuss each component of the reference architecture in this article.

I want to provide just one caveat before I get started. There are a number of vendors offering a dizzying array of offerings for big data. It is not possible for me to cover every vendor and every offering in this article.

1.      Big Data Sources

Big data types include web and social media, machine-to-machine, big transaction data, biometrics, and human generated data. This data may be in structured, unstructured, and semi-structured formats.

2.      Hadoop Distributions

Because it consists of a bewildering array of technologies with their own release schedules, Hadoop can be somewhat intimidating to the novice user. A number of vendors have created their own commercial distributions of Apache Hadoop that have undergone release testing, and bundle product support and training. Most enterprises that have deployed Hadoop for commercial use have selected one of the Hadoop distributions. Standalone vendors who offer Hadoop distributions include Cloudera, MapR, and Hortonworks. In addition, IBM offers a Hadoop distribution called InfoSphere BigInsights. Amazon Web Services offers a Hadoop framework that is part of a hosted web service called Amazon Elastic MapReduce. EMC offers a Hadoop distribution called Greenplum HD. Microsoft has also announced the availability of the community technology preview of its Hadoop distribution as a cloud-based service on Windows Azure as well as an on-premise version on Windows Server.

3.      Streaming Analytics

Hadoop is well suited to handle large volumes of data at rest. However, big data also involves high velocity data in motion. Streaming analytics, also known as complex event processing (CEP), refers to a class of technologies that leverage massively parallel processing capabilities to analyze data in motion as opposed to landing large volumes of data to disk. There are a number of open source and vendor tools in this space. For example, Apache Flume is an incubator effort that uses streaming data flows to collect, aggregate, and move large volumes of data into the Hadoop Distributed File System (HDFS). IBM offers a tool called InfoSphere Streams that grew out of early work with the United States government. StreamBase, SAP Sybase Event Stream Processor, and Informatica RulePoint also offer CEP engines.

4.      Databases

Enterprises have the ability to select from multiple database approaches:


NoSQL (“not only SQL”) databases are a category of database management systems that do not use SQL as their primary query language. These databases may not require fixed table schemas, and do not support join operations. These databases are optimized for highly scalable read-write operations rather than for consistency. NoSQL databases include a vast array of offerings such as Apache HBase, Apache Cassandra, MongoDB, Apache CouchDB, Couchbase, Riak, and Amazon DynamoDB, which forms part of the Amazon Web Services software as a service platform. DataStax offers an Enterprise edition that includes a Hadoop distribution, and replaces HDFS with the CassandraFS.


In-memory database management systems rely on main memory for data storage. Compared to traditional database management systems that store data to disk, in-memory databases are optimized for speed. In-memory databases will become increasingly important as organizations seek to process and analyze massive volumes of big data. SAP HANA, Oracle TimesTen In-Memory Database, and IBM solidDB are all examples of in-memory databases.

Apache Sqoop is a tool that allows bulk transfer of data between Hadoop and relational databases. In addition, software vendors are also upgrading their database offerings to co-exist with Hadoop as shown below:

  • Oracle – The Oracle Loader for Hadoop uses MapReduce jobs to create data sets that are optimized for loading and analytics within the Oracle relational databases. Oracle Loader for Hadoop uses the CPUs in the Hadoop cluster to format the data for Oracle relational databases. The Oracle Direct Connector forHDFSallows high-speed access to HDFS data from an Oracle database. The data stored in HDFS can then be queried via SQL in conjunction with data within the Oracle relational database.
  • IBM – IBM InfoSphere BigInsights includes a set of Java-based user-defined functions (UDFs) that enable integration with IBM DB2 using SQL.
  • Microsoft – Microsoft offers a bi-directional Hadoop connector for SQL Server.


Legacy database management systems rely on non-relational approaches to database management. Vendors will increasingly re-tool these systems to support big data. For example, the IBM DB2 Analytics Accelerator for z/OS leverages the IBM Netezza appliance to speed up queries issued against a mainframe-based data warehouse running IBM DB2 for z/OS.

5.      Big Data Integration

Big data integration technologies fall into a few different categories:

Bulk data movement

Bulk data movement includes technologies such as ETL to extract data from one or more data sources, transform the data, and load the data into a target database. IBM InfoSphere Data Stage, Version 8.7 adds the new Big Data File stage that supports reading and writing multiple files in parallel from and to Hadoop. Informatica PowerCenter has connectors for Twitter, Facebook, and LinkedIn. Informatica PowerExchange has released a Hadoop adapter that moves data from source systems into HDFS and out of HDFS into business intelligence and data warehousing environments. Informatica HParser is a data transformation tool optimized for Hadoop. Informatica’s intention is to allow users to design data integration tasks in HParser and then run them natively on Hadoop without coding. Open source data integration vendors such as Pentaho and Talend are also capturing market share with customers who like their cost effective offerings and Hadoop integration.

Data replication

Replication technologies like change data capture can capture big data, such as utility smart meter readings, in near real time with minimal impact to system performance. Replication tools include IBM InfoSphere Data Replication and Oracle GoldenGate. Informatica’s data replication tools including Fast Clone and Data Replication offer high volume replication of data to and from Hadoop.

Data virtualization
Data virtualization is also known as data federation. Data virtualization allows an application to issue SQL queries against a virtual view of data in heterogeneous sources such as in relational databases, XML documents, and on the mainframe. Vendors include IBM (InfoSphere Federation Server), Informatica (Data Services), Denodo, and Composite Software.

6.      Text Analytics

Organizations increasingly want to derive insights from large volumes of unstructured content within call center agents’ notes, social media, IT logs, and medical records. Text analytics is a method for extracting usable knowledge from unstructured text data through the identification of core concepts, sentiments, and trends, and then using this knowledge to support decision-making. SAS Text Analytics and Oracle Endeca Information Discovery offer text analytics capabilities. IBM’s text analytics capabilities are embedded in a number of products including IBM SPSS Text Analytics for Surveys, IBM InfoSphere BigInsights, IBM InfoSphere Streams, IBM Cognos Consumer Insight, IBM Content Analytics, IBM Content and Predictive Analytics for Healthcare, and IBM eDiscovery Analyzer. Clarabridge is a standalone vendor that offers text analytics of surveys, emails, social media, and call center agents’ notes to support customer experience analytics.

7.      Big Data Discovery

Vendor tools such as IBM InfoSphere Discovery and Information Analyzer, Oracle Enterprise Data Quality Profile and Audit, Informatica Data Explorer, Trillium Software’s TS Discovery, SAP BusinessObjects Data Services, and SAS DataFlux Data Management Studio supply traditional data profiling and discovery projects for structured data at rest. Informatica has also announced its intention to release native Hadoop capabilities for data discovery. We anticipate that other vendors will follow suit. Organizations also need to consider tools for search and discovery of unstructured data. These tools include Oracle Endeca Information Discovery, IBM Vivisimo, and the Google Search Appliance.

8.      Big Data Quality                                                                          

Data quality management is a discipline that includes the methods to measure and improve the quality and integrity of an organization’s data. Traditional data quality tools include IBM InfoSphere QualityStage, Informatica Data Quality, Oracle Enterprise Data Quality, Trillium Software TS Quality, SAS DataFlux Data Management Studio, and SAP BusinessObjects Data Quality Management. However, big data quality will require radically different approaches from a technology perspective. For example, organizations may need to consider the following approaches:

  • Address data quality natively within Hadoop. Informatica has announced its intention to release native Hadoop capabilities for data quality. We anticipate that other vendors will follow suit.
  • Leverage unstructured content to improve the quality of sparse data. For example, a hospital used text analytics to improve the quality of structured data attributes such as “smoker” and “drug and alcohol abuse.” As a result, the hospital improved its ability to identify patients who were most likely to be readmitted within 30 days of treatment for congestive heart failure.
  • Use CEP to improve data quality in real-time without landing data to disk. For example, a telecommunications operator used CEP to de-duplicate telecommunications call detail records in real time, a process known as mediation.

9.      Metadata

Metadata is information that describes the characteristics of any data artifact, such as its name, location, perceived importance, quality, or value to the enterprise, and its relationships to other data artifacts that the enterprise deems worth managing. Big data expands the volume, velocity, and variety of information while adding new challenges in building and maintaining a coherent metadata infrastructure. The HCatalog project (formerly known as Howl) is now part of the Apache Incubator. HCatalog is built on top of the Hive metastore and aims to address the lack of metadata support within Hadoop. A number of vendors have metadata offerings including IBM InfoSphere Business Glossary and Metadata Workbench, Informatica Metadata Manager and Business Glossary, Adaptive Metadata Manager, and ASG-Rochade. Organizations need to add big data-related business terms to their business glossaries. As organizations store more of their data within Hadoop, they will need to address data lineage and impact analysis within this environment as well.

10.  Information Policy Management

Information governance is all about managing information policies. Whether they recognize it or not, organizations grapple with five important processes relating to information policies:

i.        Documenting policies relating to data quality, metadata, privacy, and information lifecycle management. For example, a big data policy might state that call center agents should not record social security numbers in their notes.

ii.      Assigning roles and responsibilities such as data stewards, data sponsors, and data custodians.

iii.    Monitoring compliance with the data policy. In the abovementioned example, the organization might use text analytics tools to identify instances where call center agents’ notes contain social security numbers.

iv.     Defining acceptable thresholds for data issues. In the example, the information governance team might determine that the acceptable threshold needs to be zero instances because of the potential privacy implications of having social security numbers in clear text.

v.       Managing issues especially those that are long-lived and affect multiple functions and lines of business. Taking the example further, the information governance team might create a number of trouble tickets so that the customer service team can eliminate any mentions of social security numbers within agents’ notes.

Most organizations have been approaching information governance policies in a manual fashion. However, vendors now offer tools to automate the process of managing policy for all types of information including big data. Tools in this space include Kalido Data Governance Director and SAP BusinessObjects Information Steward. Organizations that have made the investment in governance, risk and compliance platforms like IBM OpenPages and EMC RSA Archer eGRC may also elect to extend these tools to document operational controls and to monitor compliance with information policies. Finally, some organizations may also choose to use an existing issue management tool like BMC Remedy to handle data-related issues, although these tools are not specifically targeted at this problem domain.

11.  Master Data Management

Organizations may want to enrich their master data with additional insight from big data. For example, they might want to link social media sentiment analysis with master data to understand if a certain customer demographic is more favorably disposed to the company’s products. Major vendor’s offerings include IBM InfoSphere Master Data Management, Oracle Master Data Management, SAP NetWeaver Master Data Management, and Informatica Master Data Management. Informatica has built a compelling demo to highlight the integration of MDM with Facebook. We anticipate that other vendors will also support integration with social media as part of the so-called “social MDM.” Organizations will also need well-governed, clean reference data such as codes for gender, countries, states, currencies, and diseases, to support their big data projects. All the major MDM vendors also offer tools to manage reference data.

12.  Data Warehouses and Data Marts

Organizations have large investments in data warehouses and data marts that may be based on relational databases (such as Oracle Database 11g and IBM DB2), columnar databases (such as SAP Sybase IQ and ParAccel), and data warehousing appliances (such as Oracle Exalytics In-Memory Machine, IBM Netezza, HP Vertica, and EMC Greenplum). The Teradata Aster MapReduce Appliance offers the ability to use SQL with a MapReduce analytics engine on a Teradata hardware platform.

As organizations adopt big data, they will increasingly follow a blended approach to integrate Hadoop and other NoSQL technologies with their traditional data warehousing environments. A large organization generated significant volumes of clickstream data from its web presence. The clickstream data had the following characteristics:

  • Data was in XML format.
  • Each user session generated large volumes of data.
  • The data was sparse and there was only a small amount of insight to be gained from vast quantities of information.
  • Licensing fees made it cost prohibitive to handle the raw clickstream data within the data warehouse.
  • The business intelligence team found it difficult to parse the XML data with their current ETL tool.

The business intelligence team used Hadoop to analyze user browsing patterns within the clickstream data. However, the team needed to marry the browsing data with the sales information in the limited number of cases where the user actually made a purchase. Because the sales information was in the data warehouse, the business intelligence team decided to use ETL to move the clickstream data for actual buyers from Hadoop into the data warehouse.

13.  Big Data Analytics and Reporting

A number of open source and vendor tools can support big data analytics and reporting.

Visualization and Reporting

Vendors such as SAS, IBM (Cognos), SAP (BusinessObjects), Tableau, QlikView, and Pentaho have offerings that can visualize and analyze big data. Vendors’ product roadmaps increasingly offer the visualization and reporting of large datasets in Hadoop. For example, SAP has demonstrated the ability to display federated queries within BusinessObjects across Hadoop and HANA instances in the cloud.

Generalized predictive analytics tools

Analytics models will increasingly incorporate big data types. For example, a predictive model for insurance claims fraud might incorporate social media relationships. Vendors are starting to address this requirement within their product roadmaps. According to recent SAS updates, the SAS/Access Interface to Hadoop allows SAS users to treat Hive as just another data source similar to relational databases, data warehousing appliances, and hierarchical databases. SAS Hadoop support allows users to submit Pig, MapReduce, and HDFS commands from within the SAS environment. SAS also provides the ability to create UDFs that can be deployed within HDFS. This includes the ability to use SAS Enterprise Miner to take analytical scoring code and produce a UDF that can be deployed within HDFS and accessed by Hive, Pig, or MapReduce code. Microsoft has also been surprisingly aggressive with Hadoop support. Microsoft’s Hive ODBC driver enables users of Microsoft SQL Server Analysis Services, PowerPivot, and Power View to interact with Hadoop data. In addition, Microsoft’s Hive add-on for Excel enables users to interact with Hadoop data from a spreadsheet environment. Finally, R is an open source package that is often used to conduct statistical analyses of large datasets in Hadoop.

Social listening

A slew of vendors such as Attensity, Lithium, and Salesforce Radian6 offer tools to address so-called “social listening” requirements. In addition, mega vendors such as IBM with Cognos Consumer Insight, Oracle with Collective Intellect, and SAS with Social Media Analytics also have offerings in this space.

Specialized analytics

A number of vendors offer specialized tools for big data. One notable example is Splunk that offers tools for analytics of machine-to-machine data from applications and network logs to reduce application downtime and improve network security.

14.  Big Data Security and Privacy

Much has already been said about the issues relating to Hadoop security. Hadoop is still an emerging technology and we anticipate that these issues will be resolved as large companies and vendors get involved. We discuss two important technologies relating to data security and privacy. To the best of our knowledge, these tools do not support Hadoop today. However, we anticipate that vendors will include Hadoop support in their product roadmaps.

Data Masking

These tools are critical to de-identify sensitive information, such as birth dates, bank account numbers, street addresses, and Social Security numbers. Tools in this space include IBM InfoSphere Optim Data Masking Solution and Informatica Data Masking solutions.

Database Monitoring

These tools enforce separation of duties and monitor access to sensitive big data by privileged users. For example, telecommunications operators can use database monitoring to monitor access to sensitive call detail records which reveal subscribers’ calling patterns. In addition, utilities can use these tools to monitor access to smart meter readings that reveal when consumers are in and out of their homes. The database monitoring functionality must have a minimal impact on database performance and should not require any changes to databases or applications. Vendors include IBM (InfoSphere Guardium) and Imperva.

15.  Big Data Lifecycle Management

Information lifecycle management (ILM) is a process and methodology for managing information through its lifecycle, from creation through disposal, including compliance with legal, regulatory, and privacy requirements. The components of a big data lifecycle management platform are listed below:

Information archiving

As big data volumes grow, organizations need solutions that enable efficient and timely archiving of structured and unstructured information while enabling its discovery for legal requirements, and its timely disposition when no longer needed by the business, legal, or records stakeholders. We discuss three types of big data below:

  • Social media – This big data types is subject to retention policies driven by e-Discovery and regulations from authorities such as the FINRA in the United States. A recent blog indicated that this trend is driving an entirely new class of social media archiving tools from vendors such as Arkovi, Backupify, Cloud Preservation, Erado, Hanzo Archives, and PageFreezer.
  • Big transaction data and machine-to-machine data – RainStor uses data compression techniques to reduce the volume of big data. RainStor delivers two editions of its product to manage massive volumes of structured, semi-structured, and unstructured data such as telephone CDRs, utility smart meter readings, and log files.
  • Hadoop – Organizations are also discovering the value of Hadoop as a cost-effective archive for applications such as email.

Vendor offerings such as Symantec Enterprise Vault, HP Autonomy Consolidated Archive, IBM Smart Archive, and EMC SourceOne are positioned as unified archives for a variety of data types.

Records and retention management

Every ILM program must maintain a catalog of laws and regulations that apply to information in the jurisdictions in which a business operates. These laws, regulations, and business needs drive the need for a retention schedule that determines how long documents should be kept and when they should be destroyed. Records management solutions enforce a business process around document retention. Vendor tools include IBM Enterprise Records, EMC Documentum Records Manager, HP Autonomy Records Manager, and OpenText Records Management.

Legal Holds and Evidence Collection (eDiscovery)

Most corporations and entities are subject to litigation and governmental investigations that require them to preserve potential evidence. Large entities may have hundreds or thousands of open legal matters with varying obligations for data. Data sources include email, instant messages, excel spreadsheets, PDF documents, audio, video, and social media. Vendor tools include Symantec Enterprise Vault, HP Autonomy eDiscovery, IBM eDiscovery Manager, Recommind Axcelerate eDiscovery Suite, Nuix eDiscovery, ZyLAB eDiscovery and Production System, and Guidance Software EnCase eDiscovery.

Test Data Management

The big data governance program needs tools to streamline the creation and management of test environments, subset and migrate data to build realistic and right-sized test databases, mask sensitive data, automate test result comparisons, and eliminate the expense and effort of maintaining multiple database clones. IBM InfoSphere Optim Test Data Management Solution and Informatica Data Subset streamline the creation and management of test environments.

16.  Cloud

Organizations are also turning to the cloud because of perceived flexibility, faster time-to-deployment, and reduced capital expenditure requirements. A number of vendors offer big data platforms in the cloud and we list a few examples below:

Mega cloud vendors

Amazon Web Services offers a Hadoop framework that is part of a hosted web service called Amazon Elastic MapReduce. The Google Cloud Platform allows organizations to build applications, store large volumes of data, and analyze massive datasets on Google’s computing infrastructure.

Data brokers
Data brokers include companies such as Acxiom, Reed Elsevier, Thomson Reuters, and, literally, thousands of others that specialize by dataset and industry. These companies offer many types of data enrichment and validation services to organizations.

Mega IT vendors
HP Converged Cloud enables organizations to move between private, hybrid, and public cloud services.

Information management software vendors
Offerings such as Trillium Software TS Quality on Demand and SAS DataFlux Marketplace provide validation, cleansing, and enrichment of name, email, and address as a service. Informatica Cloud provides data loading, synchronization, profiling, and quality services for Salesforce and other cloud applications.

In summary, big data has game changing potential with the advent of new data types and emerging technologies such as Hadoop, NoSQL, and streaming analytics. To take advantage of these developments, organizations need to create a reference architecture that integrates these emerging technologies into their existing infrastructure. As always, I would appreciate your feedback. Please feel free to leave a comment, send me an email at, or find me on Twitter at @sunilsoares1.

Posted in Big Data, Big Data, Data Governance, Sunil Soares | Tagged , , , , , | 6 Comments

IBM InfoSphere for Big Data Integration and Governance

Big Data is generally characterized in terms of the three “V’s” of volume, velocity and variety. IBM is building out a big data platform for both big data at rest and big data in motion. The core components of this platform include:

  •  IBM InfoSphere BigInsights, an enterprise-ready Hadoop distribution from IBM,
  •  IBM InfoSphere Streams, a complex event processing (CEP) engine that leverages massively parallel processing capabilities to analyze big data in motion, and
  • The IBM Netezza, a family of data warehouse appliances that support parallel in-database analytics for fast queries to very large datasets.

This blog discusses how IBM InfoSphere provides integration and governance capabilities as part of the IBM big data platform.

1. Big Data Integration

IBM InfoSphere DataStage, Version 8.7 adds the new Big Data File stage that supports reading and writing multiple files in parallel from and to Hadoop to simplify how data is merged into a common transformation process. IBM InfoSphere Data Replication supports data replication including change data capture, which can be used to integrate rapidly changing data such as utility smart meter readings. IBM InfoSphere Federation Server supports data virtualization.

2. Big Data Search and Discovery

IBM InfoSphere Information Analyzer and IBM InfoSphere Discovery can be used to profile structured data as part of a big data governance project. Before building a CEP application using IBM InfoSphere Streams, big data teams need to understand the characteristics of the data. For example, to support the real time analysis of Twitter data, the big data team would use the Twitter API to download a sample set of Tweets for further analysis. The profiling of streaming data is similar to traditional data projects using IBM InfoSphere Information Analyzer and IBM InfoSphere Discovery. Both types of projects need to understand the characteristics of the underlying data, such as the frequency of null values. IBM also recently announced the acquisition of Vivisimo for search and discovery of unstructured content. For example, an insurance carrier can reduce average handling time by providing call center agents with searchable access to multiple document repositories for customer care, alerts, policies and customer information files.

3. Big Data Quality

IBM InfoSphere QualityStage can be used to cleanse structured data as part of a big data governance project. In addition, developers can write MapReduce jobs in IBM InfoSphere BigInsights to address data quality issues for unstructured and semi-structured data in Hadoop. Finally, IBM InfoSphere Streams can deal with streaming data quality issues such the rate of arrival of data. If IBM InfoSphere Streams “knows” that a particular sensor creates events every second, then it can generate an alert if it does not receive an event after 10 seconds.

4. Metadata for Big Data

IBM InfoSphere Business Glossary and IBM InfoSphere Metadata Workbench manage business and technical metadata. The information governance team needs to extend business metadata to cover big data types. For example, the term “unique visitor” is a fundamental building block of clickstream analytics and is used to measure the number of individual users of a website. However, two sites may measure unique visitors differently, with one site counting unique visitors within a week while another one may measure unique visitors within a month.

5. Master Data Management

High quality master data is a critical enabler of a big data program. For example, the risk department at a bank can use text analytics on SEC 10-K and 10-Q financial filings to update customer risk management hierarchies in real time as ownership positions change. In another example, the operations department might use sensor data to identify defective equipment but needs consistent asset nomenclature if it wants to also replace similar equipment in other locations. Finally, organizations might embark on a “social MDM” program to link social media sentiment analysis with master data to understand if a certain customer demographic is more favorably disposed to the company’s products. IBM has taken the first step towards unifying its diverse MDM offerings including Initiate and MDM for Product Information Management under the banner of IBM InfoSphere Master Data Management V10.

6. Reference Data Management

The importance of high quality reference data for big data cannot be understated. For example, an information services company used a number of data sources like product images from the web, point of sale transaction logs and store circulars to improve the quality of its internal product master data. The company used web content to validate manufacturer-provided Universal Product Codes, which are 12-digit codes represented as barcodes on products in North America. The company also validated product attributes such as for a shampoo that was listed as 4 oz. on the web versus 3.8 oz. in the product master. To support its master data and data quality initiatives, the company maintained reference data including thousands of unique values for color. For example, it maintained reference data to indicate that “RED,” “RD,” and “ROUGE” referred to the same color. IBM InfoSphere MDM Reference Data Management Hub manages reference data such as codes for countries, states, industries, and currencies.

7. Big Data Security and Privacy

IBM offers a number of tools for big data security and privacy:

  • Data Masking – IBM InfoSphere Optim Data Masking Solution applies a variety of data transformation techniques to mask sensitive data with contextually accurate and realistic information.
  • Database activity monitoring – IBM InfoSphere Guardium creates a continuous, fine-grained audit trail of all database activities, including the “who,” “what,” “when,” “where” and “how” of each transaction. This audit trail is continuously analyzed and filtered in real-time, to identify unauthorized or suspicious activities, including by privileged users. This solution can be deployed in a number of situations that monitor access to sensitive big data. For example, telecommunications operators can use Guardium to monitor access to sensitive call detail records which reveal subscribers’ calling patterns. In addition, utilities can use Guardium to monitor access to smart meter readings that reveal when consumers are in and out of their homes.

8. Big Data Lifecycle Management

IBM has developed a robust big data lifecycle management platform:

  • Archiving – IBM Smart Archive includes software, hardware and services offerings from IBM. The solution includes IBM InfoSphere Optim and the IBM Content Collector family for multiple data types including email, file Systems, Microsoft SharePoint, SAP applications and IBM Connections. It also includes the IBM Content Manager and IBM FileNet Content Manager repositories.
  • eDiscovery – IBM eDiscovery Solution enables legal teams to define evidence obligations, coordinate with IT, records, and business teams, and reduce the cost of producing large volumes of evidence in legal matters. The solution includes IBM Atlas eDiscovery Process Management that enables legal professionals to manage a legal holds workflow.
  • Records and retention management – IBM Records and Retention Management helps an organization manage records according to a retention schedule.
  • Test Data Management – IBM InfoSphere Optim Test Data Management Solution streamlines the creation and management of test environments, subsets data to create realistic and right-sized test databases, masks sensitive data, automates test result comparisons, and eliminates the expense and effort of maintaining multiple database clones.

I anticipate that IBM will continue to build out its integration and governance capabilities for big data. Hopefully, this blog provides a useful reference for companies that already have or are considering IBM InfoSphere for big data. As always, comments are always welcome. You can also reach me at or on Twitter at @sunilsoares1.

Posted in Big Data, Big Data, Data Governance, Sunil Soares | Tagged , | Leave a comment