A Reference Architecture for Big Data must include a Focus on Governance and Integration with an Organization’s Existing Infrastructure
Reference architecture for big data.
There is a lot of hype about technologies like Apache Hadoop and NoSQL because of their ability to help organizations gain insights from vast quantities of high velocity, semi-structured, and unstructured data in a cost-effective manner. However, big data does not give IT a license to “rip and replace,” and CIOs want to understand how these technologies will interact with the organization’s technical architecture. The figure above describes a reference architecture for big data. I will discuss each component of the reference architecture in this article.
I want to provide just one caveat before I get started. There are a number of vendors offering a dizzying array of offerings for big data. It is not possible for me to cover every vendor and every offering in this article.
1. Big Data Sources
Big data types include web and social media, machine-to-machine, big transaction data, biometrics, and human generated data. This data may be in structured, unstructured, and semi-structured formats.
2. Hadoop Distributions
Because it consists of a bewildering array of technologies with their own release schedules, Hadoop can be somewhat intimidating to the novice user. A number of vendors have created their own commercial distributions of Apache Hadoop that have undergone release testing, and bundle product support and training. Most enterprises that have deployed Hadoop for commercial use have selected one of the Hadoop distributions. Standalone vendors who offer Hadoop distributions include Cloudera, MapR, and Hortonworks. In addition, IBM offers a Hadoop distribution called InfoSphere BigInsights. Amazon Web Services offers a Hadoop framework that is part of a hosted web service called Amazon Elastic MapReduce. EMC offers a Hadoop distribution called Greenplum HD. Microsoft has also announced the availability of the community technology preview of its Hadoop distribution as a cloud-based service on Windows Azure as well as an on-premise version on Windows Server.
3. Streaming Analytics
Hadoop is well suited to handle large volumes of data at rest. However, big data also involves high velocity data in motion. Streaming analytics, also known as complex event processing (CEP), refers to a class of technologies that leverage massively parallel processing capabilities to analyze data in motion as opposed to landing large volumes of data to disk. There are a number of open source and vendor tools in this space. For example, Apache Flume is an incubator effort that uses streaming data flows to collect, aggregate, and move large volumes of data into the Hadoop Distributed File System (HDFS). IBM offers a tool called InfoSphere Streams that grew out of early work with the United States government. StreamBase, SAP Sybase Event Stream Processor, and Informatica RulePoint also offer CEP engines.
Enterprises have the ability to select from multiple database approaches:
NoSQL (“not only SQL”) databases are a category of database management systems that do not use SQL as their primary query language. These databases may not require fixed table schemas, and do not support join operations. These databases are optimized for highly scalable read-write operations rather than for consistency. NoSQL databases include a vast array of offerings such as Apache HBase, Apache Cassandra, MongoDB, Apache CouchDB, Couchbase, Riak, and Amazon DynamoDB, which forms part of the Amazon Web Services software as a service platform. DataStax offers an Enterprise edition that includes a Hadoop distribution, and replaces HDFS with the CassandraFS.
In-memory database management systems rely on main memory for data storage. Compared to traditional database management systems that store data to disk, in-memory databases are optimized for speed. In-memory databases will become increasingly important as organizations seek to process and analyze massive volumes of big data. SAP HANA, Oracle TimesTen In-Memory Database, and IBM solidDB are all examples of in-memory databases.
Apache Sqoop is a tool that allows bulk transfer of data between Hadoop and relational databases. In addition, software vendors are also upgrading their database offerings to co-exist with Hadoop as shown below:
- Oracle – The Oracle Loader for Hadoop uses MapReduce jobs to create data sets that are optimized for loading and analytics within the Oracle relational databases. Oracle Loader for Hadoop uses the CPUs in the Hadoop cluster to format the data for Oracle relational databases. The Oracle Direct Connector forHDFSallows high-speed access to HDFS data from an Oracle database. The data stored in HDFS can then be queried via SQL in conjunction with data within the Oracle relational database.
- IBM – IBM InfoSphere BigInsights includes a set of Java-based user-defined functions (UDFs) that enable integration with IBM DB2 using SQL.
- Microsoft – Microsoft offers a bi-directional Hadoop connector for SQL Server.
Legacy database management systems rely on non-relational approaches to database management. Vendors will increasingly re-tool these systems to support big data. For example, the IBM DB2 Analytics Accelerator for z/OS leverages the IBM Netezza appliance to speed up queries issued against a mainframe-based data warehouse running IBM DB2 for z/OS.
5. Big Data Integration
Big data integration technologies fall into a few different categories:
Bulk data movement
Bulk data movement includes technologies such as ETL to extract data from one or more data sources, transform the data, and load the data into a target database. IBM InfoSphere Data Stage, Version 8.7 adds the new Big Data File stage that supports reading and writing multiple files in parallel from and to Hadoop. Informatica PowerCenter has connectors for Twitter, Facebook, and LinkedIn. Informatica PowerExchange has released a Hadoop adapter that moves data from source systems into HDFS and out of HDFS into business intelligence and data warehousing environments. Informatica HParser is a data transformation tool optimized for Hadoop. Informatica’s intention is to allow users to design data integration tasks in HParser and then run them natively on Hadoop without coding. Open source data integration vendors such as Pentaho and Talend are also capturing market share with customers who like their cost effective offerings and Hadoop integration.
Replication technologies like change data capture can capture big data, such as utility smart meter readings, in near real time with minimal impact to system performance. Replication tools include IBM InfoSphere Data Replication and Oracle GoldenGate. Informatica’s data replication tools including Fast Clone and Data Replication offer high volume replication of data to and from Hadoop.
Data virtualization is also known as data federation. Data virtualization allows an application to issue SQL queries against a virtual view of data in heterogeneous sources such as in relational databases, XML documents, and on the mainframe. Vendors include IBM (InfoSphere Federation Server), Informatica (Data Services), Denodo, and Composite Software.
6. Text Analytics
Organizations increasingly want to derive insights from large volumes of unstructured content within call center agents’ notes, social media, IT logs, and medical records. Text analytics is a method for extracting usable knowledge from unstructured text data through the identification of core concepts, sentiments, and trends, and then using this knowledge to support decision-making. SAS Text Analytics and Oracle Endeca Information Discovery offer text analytics capabilities. IBM’s text analytics capabilities are embedded in a number of products including IBM SPSS Text Analytics for Surveys, IBM InfoSphere BigInsights, IBM InfoSphere Streams, IBM Cognos Consumer Insight, IBM Content Analytics, IBM Content and Predictive Analytics for Healthcare, and IBM eDiscovery Analyzer. Clarabridge is a standalone vendor that offers text analytics of surveys, emails, social media, and call center agents’ notes to support customer experience analytics.
7. Big Data Discovery
Vendor tools such as IBM InfoSphere Discovery and Information Analyzer, Oracle Enterprise Data Quality Profile and Audit, Informatica Data Explorer, Trillium Software’s TS Discovery, SAP BusinessObjects Data Services, and SAS DataFlux Data Management Studio supply traditional data profiling and discovery projects for structured data at rest. Informatica has also announced its intention to release native Hadoop capabilities for data discovery. We anticipate that other vendors will follow suit. Organizations also need to consider tools for search and discovery of unstructured data. These tools include Oracle Endeca Information Discovery, IBM Vivisimo, and the Google Search Appliance.
8. Big Data Quality
Data quality management is a discipline that includes the methods to measure and improve the quality and integrity of an organization’s data. Traditional data quality tools include IBM InfoSphere QualityStage, Informatica Data Quality, Oracle Enterprise Data Quality, Trillium Software TS Quality, SAS DataFlux Data Management Studio, and SAP BusinessObjects Data Quality Management. However, big data quality will require radically different approaches from a technology perspective. For example, organizations may need to consider the following approaches:
- Address data quality natively within Hadoop. Informatica has announced its intention to release native Hadoop capabilities for data quality. We anticipate that other vendors will follow suit.
- Leverage unstructured content to improve the quality of sparse data. For example, a hospital used text analytics to improve the quality of structured data attributes such as “smoker” and “drug and alcohol abuse.” As a result, the hospital improved its ability to identify patients who were most likely to be readmitted within 30 days of treatment for congestive heart failure.
- Use CEP to improve data quality in real-time without landing data to disk. For example, a telecommunications operator used CEP to de-duplicate telecommunications call detail records in real time, a process known as mediation.
Metadata is information that describes the characteristics of any data artifact, such as its name, location, perceived importance, quality, or value to the enterprise, and its relationships to other data artifacts that the enterprise deems worth managing. Big data expands the volume, velocity, and variety of information while adding new challenges in building and maintaining a coherent metadata infrastructure. The HCatalog project (formerly known as Howl) is now part of the Apache Incubator. HCatalog is built on top of the Hive metastore and aims to address the lack of metadata support within Hadoop. A number of vendors have metadata offerings including IBM InfoSphere Business Glossary and Metadata Workbench, Informatica Metadata Manager and Business Glossary, Adaptive Metadata Manager, and ASG-Rochade. Organizations need to add big data-related business terms to their business glossaries. As organizations store more of their data within Hadoop, they will need to address data lineage and impact analysis within this environment as well.
10. Information Policy Management
Information governance is all about managing information policies. Whether they recognize it or not, organizations grapple with five important processes relating to information policies:
i. Documenting policies relating to data quality, metadata, privacy, and information lifecycle management. For example, a big data policy might state that call center agents should not record social security numbers in their notes.
ii. Assigning roles and responsibilities such as data stewards, data sponsors, and data custodians.
iii. Monitoring compliance with the data policy. In the abovementioned example, the organization might use text analytics tools to identify instances where call center agents’ notes contain social security numbers.
iv. Defining acceptable thresholds for data issues. In the example, the information governance team might determine that the acceptable threshold needs to be zero instances because of the potential privacy implications of having social security numbers in clear text.
v. Managing issues especially those that are long-lived and affect multiple functions and lines of business. Taking the example further, the information governance team might create a number of trouble tickets so that the customer service team can eliminate any mentions of social security numbers within agents’ notes.
Most organizations have been approaching information governance policies in a manual fashion. However, vendors now offer tools to automate the process of managing policy for all types of information including big data. Tools in this space include Kalido Data Governance Director and SAP BusinessObjects Information Steward. Organizations that have made the investment in governance, risk and compliance platforms like IBM OpenPages and EMC RSA Archer eGRC may also elect to extend these tools to document operational controls and to monitor compliance with information policies. Finally, some organizations may also choose to use an existing issue management tool like BMC Remedy to handle data-related issues, although these tools are not specifically targeted at this problem domain.
11. Master Data Management
Organizations may want to enrich their master data with additional insight from big data. For example, they might want to link social media sentiment analysis with master data to understand if a certain customer demographic is more favorably disposed to the company’s products. Major vendor’s offerings include IBM InfoSphere Master Data Management, Oracle Master Data Management, SAP NetWeaver Master Data Management, and Informatica Master Data Management. Informatica has built a compelling demo to highlight the integration of MDM with Facebook. We anticipate that other vendors will also support integration with social media as part of the so-called “social MDM.” Organizations will also need well-governed, clean reference data such as codes for gender, countries, states, currencies, and diseases, to support their big data projects. All the major MDM vendors also offer tools to manage reference data.
12. Data Warehouses and Data Marts
Organizations have large investments in data warehouses and data marts that may be based on relational databases (such as Oracle Database 11g and IBM DB2), columnar databases (such as SAP Sybase IQ and ParAccel), and data warehousing appliances (such as Oracle Exalytics In-Memory Machine, IBM Netezza, HP Vertica, and EMC Greenplum). The Teradata Aster MapReduce Appliance offers the ability to use SQL with a MapReduce analytics engine on a Teradata hardware platform.
As organizations adopt big data, they will increasingly follow a blended approach to integrate Hadoop and other NoSQL technologies with their traditional data warehousing environments. A large organization generated significant volumes of clickstream data from its web presence. The clickstream data had the following characteristics:
- Data was in XML format.
- Each user session generated large volumes of data.
- The data was sparse and there was only a small amount of insight to be gained from vast quantities of information.
- Licensing fees made it cost prohibitive to handle the raw clickstream data within the data warehouse.
- The business intelligence team found it difficult to parse the XML data with their current ETL tool.
The business intelligence team used Hadoop to analyze user browsing patterns within the clickstream data. However, the team needed to marry the browsing data with the sales information in the limited number of cases where the user actually made a purchase. Because the sales information was in the data warehouse, the business intelligence team decided to use ETL to move the clickstream data for actual buyers from Hadoop into the data warehouse.
13. Big Data Analytics and Reporting
A number of open source and vendor tools can support big data analytics and reporting.
Visualization and Reporting
Vendors such as SAS, IBM (Cognos), SAP (BusinessObjects), Tableau, QlikView, and Pentaho have offerings that can visualize and analyze big data. Vendors’ product roadmaps increasingly offer the visualization and reporting of large datasets in Hadoop. For example, SAP has demonstrated the ability to display federated queries within BusinessObjects across Hadoop and HANA instances in the cloud.
Generalized predictive analytics tools
Analytics models will increasingly incorporate big data types. For example, a predictive model for insurance claims fraud might incorporate social media relationships. Vendors are starting to address this requirement within their product roadmaps. According to recent SAS updates, the SAS/Access Interface to Hadoop allows SAS users to treat Hive as just another data source similar to relational databases, data warehousing appliances, and hierarchical databases. SAS Hadoop support allows users to submit Pig, MapReduce, and HDFS commands from within the SAS environment. SAS also provides the ability to create UDFs that can be deployed within HDFS. This includes the ability to use SAS Enterprise Miner to take analytical scoring code and produce a UDF that can be deployed within HDFS and accessed by Hive, Pig, or MapReduce code. Microsoft has also been surprisingly aggressive with Hadoop support. Microsoft’s Hive ODBC driver enables users of Microsoft SQL Server Analysis Services, PowerPivot, and Power View to interact with Hadoop data. In addition, Microsoft’s Hive add-on for Excel enables users to interact with Hadoop data from a spreadsheet environment. Finally, R is an open source package that is often used to conduct statistical analyses of large datasets in Hadoop.
A slew of vendors such as Attensity, Lithium, and Salesforce Radian6 offer tools to address so-called “social listening” requirements. In addition, mega vendors such as IBM with Cognos Consumer Insight, Oracle with Collective Intellect, and SAS with Social Media Analytics also have offerings in this space.
A number of vendors offer specialized tools for big data. One notable example is Splunk that offers tools for analytics of machine-to-machine data from applications and network logs to reduce application downtime and improve network security.
14. Big Data Security and Privacy
Much has already been said about the issues relating to Hadoop security. Hadoop is still an emerging technology and we anticipate that these issues will be resolved as large companies and vendors get involved. We discuss two important technologies relating to data security and privacy. To the best of our knowledge, these tools do not support Hadoop today. However, we anticipate that vendors will include Hadoop support in their product roadmaps.
These tools are critical to de-identify sensitive information, such as birth dates, bank account numbers, street addresses, and Social Security numbers. Tools in this space include IBM InfoSphere Optim Data Masking Solution and Informatica Data Masking solutions.
These tools enforce separation of duties and monitor access to sensitive big data by privileged users. For example, telecommunications operators can use database monitoring to monitor access to sensitive call detail records which reveal subscribers’ calling patterns. In addition, utilities can use these tools to monitor access to smart meter readings that reveal when consumers are in and out of their homes. The database monitoring functionality must have a minimal impact on database performance and should not require any changes to databases or applications. Vendors include IBM (InfoSphere Guardium) and Imperva.
15. Big Data Lifecycle Management
Information lifecycle management (ILM) is a process and methodology for managing information through its lifecycle, from creation through disposal, including compliance with legal, regulatory, and privacy requirements. The components of a big data lifecycle management platform are listed below:
As big data volumes grow, organizations need solutions that enable efficient and timely archiving of structured and unstructured information while enabling its discovery for legal requirements, and its timely disposition when no longer needed by the business, legal, or records stakeholders. We discuss three types of big data below:
- Social media – This big data types is subject to retention policies driven by e-Discovery and regulations from authorities such as the FINRA in the United States. A recent blog indicated that this trend is driving an entirely new class of social media archiving tools from vendors such as Arkovi, Backupify, Cloud Preservation, Erado, Hanzo Archives, and PageFreezer.
- Big transaction data and machine-to-machine data – RainStor uses data compression techniques to reduce the volume of big data. RainStor delivers two editions of its product to manage massive volumes of structured, semi-structured, and unstructured data such as telephone CDRs, utility smart meter readings, and log files.
- Hadoop – Organizations are also discovering the value of Hadoop as a cost-effective archive for applications such as email.
Vendor offerings such as Symantec Enterprise Vault, HP Autonomy Consolidated Archive, IBM Smart Archive, and EMC SourceOne are positioned as unified archives for a variety of data types.
Records and retention management
Every ILM program must maintain a catalog of laws and regulations that apply to information in the jurisdictions in which a business operates. These laws, regulations, and business needs drive the need for a retention schedule that determines how long documents should be kept and when they should be destroyed. Records management solutions enforce a business process around document retention. Vendor tools include IBM Enterprise Records, EMC Documentum Records Manager, HP Autonomy Records Manager, and OpenText Records Management.
Legal Holds and Evidence Collection (eDiscovery)
Most corporations and entities are subject to litigation and governmental investigations that require them to preserve potential evidence. Large entities may have hundreds or thousands of open legal matters with varying obligations for data. Data sources include email, instant messages, excel spreadsheets, PDF documents, audio, video, and social media. Vendor tools include Symantec Enterprise Vault, HP Autonomy eDiscovery, IBM eDiscovery Manager, Recommind Axcelerate eDiscovery Suite, Nuix eDiscovery, ZyLAB eDiscovery and Production System, and Guidance Software EnCase eDiscovery.
Test Data Management
The big data governance program needs tools to streamline the creation and management of test environments, subset and migrate data to build realistic and right-sized test databases, mask sensitive data, automate test result comparisons, and eliminate the expense and effort of maintaining multiple database clones. IBM InfoSphere Optim Test Data Management Solution and Informatica Data Subset streamline the creation and management of test environments.
Organizations are also turning to the cloud because of perceived flexibility, faster time-to-deployment, and reduced capital expenditure requirements. A number of vendors offer big data platforms in the cloud and we list a few examples below:
Mega cloud vendors
Amazon Web Services offers a Hadoop framework that is part of a hosted web service called Amazon Elastic MapReduce. The Google Cloud Platform allows organizations to build applications, store large volumes of data, and analyze massive datasets on Google’s computing infrastructure.
Data brokers include companies such as Acxiom, Reed Elsevier, Thomson Reuters, and, literally, thousands of others that specialize by dataset and industry. These companies offer many types of data enrichment and validation services to organizations.
Mega IT vendors
HP Converged Cloud enables organizations to move between private, hybrid, and public cloud services.
Information management software vendors
Offerings such as Trillium Software TS Quality on Demand and SAS DataFlux Marketplace provide validation, cleansing, and enrichment of name, email, and address as a service. Informatica Cloud provides data loading, synchronization, profiling, and quality services for Salesforce and other cloud applications.
In summary, big data has game changing potential with the advent of new data types and emerging technologies such as Hadoop, NoSQL, and streaming analytics. To take advantage of these developments, organizations need to create a reference architecture that integrates these emerging technologies into their existing infrastructure. As always, I would appreciate your feedback. Please feel free to leave a comment, send me an email at firstname.lastname@example.org, or find me on Twitter at @sunilsoares1.