I had the opportunity to sit in on a presentation by Arka Mukherjee, CEO and Founder of Global IDs, at the Data Governance Winter Conference in Ft. Lauderdale in December 2012. The session had an intrigued title that captured my interest, “Merging Big Data with Enterprise Data.” Arka acknowledged that most of Global IDs use cases are around “small data” today. However, the company works with several Fortune 1000 companies including a large telecommunications operator.
Arka provided some useful background information on big data from the perspective of Global IDs:
- Volume – Most of the big data use cases deal with petabytes versus megabytes of data
- Format – Big data is primarily in unstructured format
- Sparsity – Big data has very low signal to noise ratio, sort of like “searching for a needle in a needle stack”
- NoSQL – Big data involves NoSQL database platforms like Cassandra, Hadoop, MongoDB, Neo4j, etc.
Arka went on to emphasize that big data requires some unique skills:
- Data modeling with ontologies
- Data analytics with R
- Data movement in and out of NoSQL DBs
- Data governance on Resource Description Framework (RDF) stores
Arka mentioned the following use cases around linking big data with enterprise master data:
- Linking product master data with product safety websites
- Linking enterprise reference data with Linked Open Data, which is the biggest integration project linking thousands of government databases with RDF
- Linking supplier master data with news websites
- Linking enterprise master data with third party data like market data, demographics, credit reports from Acxiom and D&B
Global IDs provides “extreme data governance” capabilities in large and complex data environments. They claim to take the problem away from humans, and let machines handle these processing-intensive tasks. Arka mentioned that a recent client had a million databases, and the average client will look at three to four million attributes.
An interesting company that I will definitely want to learn more about.
I’ve been hearing a blot of buzz from clients about Pentaho. One large bank mentioned that they loaded their Hadoop environment in seconds with Pentaho versus several minutes with other leading ETL tools. So I decided to do some research on Pentaho to see what all the fuss was about.
It turns out that Pentaho offers a tightly integrated suite for business intelligence and data integration.
Pentaho Business Analytics offers reporting, dashboarding, and analytical capabilities based on in-memory technology. The tool offers strong visualization capabilities including support for scatter plots, heat grids, and geo-mapping.
Pentaho Business Analytics leverages Pentaho Data Integration to provide analytics for data residing in relational databases, NoSQL stores, Hadoop, and business intelligence appliances. Pentaho Data Integration is an ETL tool that supports connectivity to OLTP, analytical and NoSQL databases such as Oracle, Greenplum, Teradata, Netezza, Apache Cassandra, and MongoDB as well as to unstructured and semi-structured sources such as Hadoop, Excel, XML, and RSS feeds.
The Pentaho platform comes in two basic flavors: a community edition that is open source, and an enterprise edition that includes product support and advanced features.
Pentaho has been getting strong traction with organizations due to its strong big data support, open source heritage, and cost effectiveness.
IBM System z is the premier mainframe platform in the market today. Many of the largest enterprises in the world rely on System z for their business critical operations. At the same time, these organizations are also maturing their data governance and big data initiatives. However, in many cases, these organizations have not tied their data governance and big data initiatives back to the mainframe where most of their mission-critical data currently resides.
The light bulb went off for me when I was talking to the manager of data governance at a financial services institution. We were establishing information policies around email addresses. They had to quantify the number of missing email addresses but that data was on the mainframe. They wanted to set policies around customer duplicates but their customer information file was on the mainframe. Their CIO wanted to reduce storage costs. Most of that data was on the mainframe. Their Chief Information Security Officer (CISO) needed to set policies, but most of the data was (you guessed it) on the mainframe.
Now, let’s talk about big data. Yes, there’s lots of hype but I fully expect that some of those vast oceans of data will land on System z. An insurer recently mentioned that they were looking at a telematics program that placed sensors on automobiles. They also mentioned that this treasure trove of sensor data (big data) would probably end up in the System z environment.
IBM actually has a broad portfolio of tools for governing big and small data on System z. I have listed a few below:
- Data Profiling – Assessing the current state of the data is often the first step in a data governance program. IBM InfoSphere Information Analyzer offers data profiling capabilities on System z.
- Data Discovery – A large financial institution had thousands of VSAM files. They found that Social Security Numbers were hidden in a field called EMP_NUM. A data discovery tool like IBM InfoSphere Discovery helped them discover hidden sensitive data on the mainframe.
- Business Glossary – IBM InfoSphere Business Glossary for Linux on System z enables large mainframe shops to deploy their business glossaries in a System z environment.
- Data Archiving – I know of at least one large institution that had data from the 1950’s still sitting on the mainframe. IBM InfoSphere Optim Data Growth helped rationalize their MIPS costs by moving older data to less expensive storage.
- Database Monitoring – Many large financial institutions want to monitor access by privileged users like DBAs to sensitive data on the mainframe. IBM InfoSphere Guardium now offers S-TAP support for IMS, VSAM and DB2 on z/OS.
- Big Data – The IBM DB2 Analytics Accelerator for z/OS allows companies to accelerate analytical queries by leveraging the power of a Netezza appliance.
I am not suggesting that every data governance program needs to now focus on System z. However, there are many large IBM mainframe shops that are building out their data governance and big data programs. I do believe that these organizations would benefit from a better alignment between System z and their data governance and big data initiatives.
Information policies are a crucial deliverable from any data governance program. Whether they recognize it or not, organizations grapple with five important processes relating to information policies:
- Documenting policies relating to data quality, metadata, privacy, and information lifecycle management. For example, an information policy might state that call center agents need to search for a customer name before creating a new record.
- Assigning roles and responsibilities such as data stewards, data sponsors, and data custodians.
- Monitoring compliance with the information policy. In the abovementioned example, the organization might measure the number of duplicate customer records to monitor adherence by call center agents to the information policy.
- Defining acceptable thresholds for data issues. In the example, the data governance team may determine that three percent duplicates are an acceptable threshold for customer data quality because it is uneconomical to pursue issue resolution when the percentage of duplicates falls below that level.
- Managing issues especially those that are long-lived and affect multiple functions and lines of business. Taking the example further, the data governance team may create a number of trouble tickets so that the customer data stewards can eliminate duplicate records.
Most organizations have a manual approach to information policy management. This approach may work well at the start but quickly becomes unstainable as the number of information policies increases.
Several software vendors have been trying to offer tools to automate the process of managing policy for all types of information. The functionality of these tools depends on their heritage. We classify these tools into the following categories:
- Standalone tools – Kalido Data Governance Director would be noteworthy in this category.
- Tools with a business glossary heritage – Collibra Business Semantics Glossary and IBM InfoSphere Business Glossary are primarily focused on managing business terms. However, the functionality of these tools around stewardship, categories and workflows means that they can also be extended to managing information policies.
- Enterprise applications – Enterprise applications offer a mechanism to manage information policies in the context of key business processes. I have been writing extensively about this topic. http://bit.ly/QfnM5N
As an example, SAP BusinessObjects Information Steward provides targeted capabilities for data stewards to manage and monitor data quality scorecards, data validation business rules, business definitions, and metadata. In addition, SAP Master Data Governance provides capabilities to enforce information policies in the context of business processes.
- GRC tools – Many organizations have made significant investments in Governance, Risk and Compliance (GRC) platforms like IBM OpenPages and EMC RSA Archer eGRC. These organizations may also elect to extend these tools to document operational controls and to monitor compliance with information policies.
- Issue management platforms – I know of at least one organization may also choose to use an existing issue management tool like BMC Remedy to track data-related issues although these tools are not specifically targeted at this problem domain.
I intend to write more about this topic. Have I missed anything? Please comment.
I have been hearing some buzz about Collibra’s data governance software over the past year or so. The chatter came from sales reps, clients and analysts. I’ve also run into the Collibra folks at different conferences. So I decided to do a bit of research on the company. They are VC-backed and based out of Belgium but with operations in the U.S. and Northern Europe. I had a couple of meetings with Stijn Christiaens, COO and Co-Founder, and Benny Verhaeghe, Director Sales & Marketing. Stijn just did a couple of demonstrations of their software.
I must say that I really like it. The software does a really nice job of what it is meant to do. It has a nice, intuitive interface and provides a workflow to manage business terms and data governance policies. It also lets an administrator set up roles within the organization such as the data governance council and data stewards. The whole area of information policy management is underserved by existing software vendors, and Collibra makes some strides in this regard.
Of course, Collibra does not try to become a full-fledged data governance platform. The tool does not yet offer support for technical metadata, data profiling, data quality or master data management. But it is a nice entry-level tool for organizations who want to get started with data governance.
I strongly believe that industry-orientation and verticalization is the next big step in the evolution of data governance. Most data governance training programs deal with best practices in a cross-industry manner. However, data governance in banking is completely different from data governance in health plans. I am already doing a public course on Data Governance Fundamentals in Chicago February 5-6, 2013
So I got around to thinking that it would be valuable to put something like this together for a specific industry. Wouldn’t it be nice if health plans had a data governance template that was specific to their industry? A data governance charter for health plans. A sample data governance organization for health plans with roles and responsibilities. A sample data quality scorecard with member and provider KPIs. A sample business case.
My last two books “Selling Information Governance to the Business” and “Big Data Governance” both have chapters on healthcare. Based on these books and my own consulting experience, I have developed a two-day training class tailored to health plans. I have already been delivering this class over the past several months.
The topics for the two-day class are as follows:
- Overview of data governance in health plans
- Building the business case for data governance in health plans (e.g., Member 360)
- Member Data Governance
- Provider Data Governance
- Organizing for data governance (e.g., sample data governance charter for health plans, sample data governance organization including Medical Informatics, Member Services, Network Management, Marketing, Finance and Privacy)
- Data Stewardship Fundamentals
- Writing data governance policies (e.g., sharing claims data with external parties)
- Building a business glossary
- Creating a data quality scorecard
- Information lifecycle governance overview (e.g., defensible disposition, test data management, archiving and data retention)
- Aligning with Security and Privacy (e.g., HIPAA compliance, aligning with the chief information security officer and chief privacy officer, leveraging data discovery to discover sensitive data)
- Big data governance
- Reference architecture for data governance
At the end of the class, participants will have a binder of courseware that is specific to health plans. I have taken great pains to use only healthcare examples and content.
I will be emphasizing the industry-orientation of data governance during my workshop on Industry Best Practices at the Data Governance Winter Conference in Ft. Lauderdale in a few weeks.
These are truly exciting times.