Unstructured data from such sources as forms, e-mail or documents - contains a great deal of information that can be usefully employed in a business intelligence system. Some of this information is vitally important. The Enron and related scandals have made the location and retrieval of unstructured data - in the form of e-mails and corporate documents - a business process that can determine the very survival of a corporation.
The Sarbanes-Oxley Act has greatly spurred the extension of structured database management systems into the realm of content management - the addition of unstructured data objects into the safe harbor and support of an enterprise database. Sarbanes-Oxley requires corporations to be able retrieve any data (especially documents and e-mail) that pertain to the reliability of a corporation's financial statements. Database vendors are scrambling to deploy enterprise-class content management solutions that ensure compliance with this politically sensitive and highly visible legislation.
To date, tools used to build business intelligence systems have been developed independently of those designed to solve content management problems. However, the two are beginning to merge. This article discusses the challenges to merging the technologies and provides an example where the two can be merged to solve an organizational problem
The Nature of the Problem
Finding data in an unstructured document is a challenge. Several examples illustrate this challenge:
- The source document is paper, not electronic. Insurance, medical and human resource forms are often paper-based. Note that the data of interest, in this case, is reasonably structured. Its position on the source document can be spatially located.
- The source document is structured - as in example 1 - but it is already in an electronic form. A Web technology such as XML may be used.
- The source document is electronic, but the data of interest is not structured. E-mail and word processing documents fall into this category.
- The source document is paper, and the data is unstructured. There is no electronic representation of the document - perhaps a historical document prepared before the advent of word processing.
- The source is a "blob," not a document - such as pictures, voice or video.
It is useful to classify unstructured data as one of two types: "structured/unstructured" (types 1 and 2 above) and "unstructured/unstructured" (types 3, 4 and 5).
In type 1, the location of the data of interest to our business intelligence system is known - if we can only bridge the gap from paper form to electronic form.
Type 2 is the easiest, from a data extraction point of view. The data is really no different from a traditional, record-oriented source of a business intelligence system. It is electronic and it is form-based (structured).
In types 3 and 4, we have a greater challenge. Even if the source document is already electronic (type 3), we cannot spatially locate the data of interest in the document - that is, the data we want is not always on a certain line and at a certain position within that line. Type 4 has all the problems of type 3, with the additional complication of being on paper.
Types 3 and 4 are text-based, implying that some form of text processing or text mining might help us find the data of interest.
Type 5 seems the most difficult of all, since the source is impervious to common data- or text-processing techniques. This type of data is undoubtedly the source of great interest to intelligence agencies and is probably currently the focus of intense research efforts.
Until recently, content management and business intelligence capabilities have been developing in parallel. There has been relatively little effort - or motivation - to integrate these two areas before now.
How Can this Problem be Solved
The electronic sources of data for this enterprise database present a well-understood exercise of integrating structured data from multiple sources into an integrated database. Of course, saying that the problem is well understood does not mean that it is "easy." There are a myriad of technical and business challenges in the extract, transform and load (ETL) process that have to be addressed to ensure the resulting database meets enterprise objectives.
The harder problem is capturing data from the forms - there are millions of them - and converting the image of data into digital data. However, once this problem is solved, the problem becomes "easy" again - the paper-based data can now be integrated into the enterprise database using a variety of ETL techniques.
The solution to this problem requires the involvement of a third information technology discipline, which has also been developing in parallel to the other two (ETL and content management) - scanning technologies.
Paper management has long been a problem for large enterprises. Insurance companies are probably at the forefront of this process, due the paper/document intensive nature of many of their business processes. Companies such as Xerox and Kodak have long served this market. These vendors have developed scanning technologies that can quickly turn paper into an image - and, equally important from the business intelligence perspective - produce meta data that enables one to find a document easily. Examples of commonly used meta data include SSN, name and form ID.
These "image vendors" also provide data conversion technologies - for example, a data entry operator can use a light pen and inform the computer of the coordinates of the SSN field on the form. The conversion technology can read the field and convert it to data.
Much of this data conversion process is manual, however. Depending on the nature and business complexity of the conversion, it may be cheaper or more reliable to employ two technicians to transcribe the same field and then compare the results.
There are usually two outputs from this conversion process - the scanned image and the converted data. Frequently, the scanned image needs to be retained - often for evidentiary purposes in case the form becomes involved in legislation. The content management components of the database are responsible for storing this data, indexing it with the appropriate meta data, and - perhaps most importantly of all - retaining a paper trail. Sarbanes-Oxley and DOD 5015.2 impose strict requirements on this document management process. Database vendors are continuously upgrading content management functionality - particularly the records management components - to meet these requirements.
The second output - the converted data - is typically a flat file. For seasoned designers of a business intelligence system, this flat file becomes an input for an ETL process.
Conceptually, at least, we are done. We have a solution to the types 1 and 2 problems. We have used three previously unrelated technologies - structured data management, unstructured content management and image scanning/conversion technologies - to create an integrated solution to the problem of integrating structured and unstructured data into an integrated, enterprise database.
Practically speaking, however, there are a number of issues that need to be addressed. One in particular, in the case of this client, is the issue of data primacy. Retirement data obtained from a paper document may conflict with that obtained from an electronic source.
A second significant problem is that the structure of a form changes over time. For example, a form to capture the particulars of a new employee hire will change, as the enterprise needs to know something new or different about new employees. Affirmative action data is but one of many examples. The addition of new data means that the position of the new data has to be captured - and possibly causes a position shift of other data on the form. Data positionality is almost transparent in relational data sources, but a significant and ongoing concern in document capture and conversion.
Types 3 and 4 problems may be solved by integrating text mining technologies into a similar architecture as described in this article.
No comments:
Post a Comment