Taming unstructured data

DarinLStewartThe age of information overload is slowly drawing to a close. The enterprise is finally getting comfortable with managing massive amounts of data, content and information. The pace of information creation continues to accelerate, but the ability of infrastructure and information management to keep pace is coming within sight. Big Data is now considered a blessing rather than a curse. 

Even so, managing information is not the same as fully exploiting information. While Big Data technologies and techniques are unlocking secrets previously hidden in enterprise data, the largest source of potential insight remains largely untapped. Unstructured content represents as much as 80 percent of an organisation’s total information assets. While Big Data technologies and techniques are well suited to exploring unstructured information, this ‘Big Content’ remains grossly underutilised and its potential largely unexplored.

Gartner defines unstructured data as content that does not conform to a specific, pre-defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables. Within the enterprise, unstructured content takes many forms, chief among which are business documents (reports, presentations, spreadsheets and the like), email and Web content. Each of these content sources has mature disciplines supporting them. Business documents are shepherded through their lifecycle by ECM platforms. Email is managed, monitored and archived along with other text-based communication channels. Ever more sophisticated Web content is matched by equally sophisticated Web content management tools. But each of these platforms is focused on management and retention rather than analysis and exploration. They can, however, provide a robust foundation that will support a Big Content infrastructure.

Enterprise-owned and operated information is only part of the Big Content equation. The potential for insight and intelligence expands dramatically when enterprise information is augmented and enhanced with public information. Content from the social stream can be a direct line into the hearts and minds of customers. Blogs, tweets, comments and ratings are a reflection of the current state of public sentiment. More traditional Web content such as news articles, product information and simple corporate informational Web pages become an extension of internal research when tamed. More formal data sources are emerging in the public realm in the form of smart disclosure information from various areas of government in the U.S. and Linked Open Data across the globe. All of these unstructured (and semi-structured) information sources become valuable extensions to enterprise information resources when approached in a Big Content manner.

This approach combines the technologies and techniques of Big Data with the unique capabilities of advanced content management and enterprise search. This is a powerful combination that facilitates knowledge discovery in ways not previously possible. Internal documents, email and collaboration artifacts can be combined with public Web and social content to uncover product issues before they become an embarrassment. Purpose-built, unified search indexes and applications can illuminate intellectual property holdings by bringing together innovation indicators across the enterprise while also providing a unified view of how it fits into the broader market and patent landscape.

In most cases, the raw materials facilitating these use cases already exist in the enterprise. Many organisations have solutions in production and are answering previously unanswerable questions as a result.

Big Data and Big Content answers and insights do not come easily nor inexpensively. Unstructured content will be plentiful across the enterprise, but it tends to be isolated, unorganised and unmanaged. Even when a content management system (CMS) has been employed to support content and an enterprise search engine is in place, they have not been deployed or maintained with an eye to analysis. They are necessary, but not sufficient, components of a Big Content solution. Big Data technologies and techniques can be employed to bridge the gap. Even where a mature Big Data practice exists, however, factors unique to unstructured content must be taken into account.

The answers to many critical business questions are often contained within the unstructured content scattered across the enterprise. By refining, reconciling and integrating those content resources with enhanced Big Data technologies and techniques, Big Content enables the sort of deep, non-obvious analysis previously reserved for structured data resources.

Previous ArticleNext Article

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.


The free newsletter covering the top industry headlines