Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
T. S. Eliot1
The Information Age is in full swing. Governments, software companies, and the media have declared this fact for years. For the first time in the history of man, information is seen as a commodity rather than as a vague, ethereal concept.
As we enter into this Information Age, the methods of information gathering, storage, and retrieval have remained largely unchanged. Information is gathered for the moment, not for posterity. Little thought is given to the format and media the data is stored in. As a result, this information the commodity we are slowly but surely basing our global economy on is piling up as a non-standard, convoluted, disorganized mess.
We stand on the dawn of great advances in technology and knowledge that will benefit humankind in ways we can't comprehend; yet the darkness of the past stands ready to pollute our future and slow our advances.
The cause of the problem at hand is rooted in pride and compounded by shortsightedness.
In virtually every instance today, when content is produced, content is targeted to one particular format or media. That is, a project is conceived, and when a project is conceived it is conceived for delivery by a particular medium to a particular audience. Once this target format/medium is produced, the content is relegated to spend its days bound by the restrictions of a native, proprietary, formatting-based solution.
But why does this occur? Why is data bound by certain formats to spend a life of relative inactivity once an initial, short-sighted purpose has been served? This occurs because the computing public is trained to work in this manner, and tools of the day enforce this methodology upon those who use them.
The primary cause of this tragedy is the common practice of enhancing data by encoding formatting instructions within the text stream. This practice has been en vogue since the earliest days of word processors because people are concerned with quality of output. And this concern is rightly so; however, the solution of simply adding "formatting codes" within the text stream has dire consequences in today's Information Age. Formatting codes pollute data because they are, in essence, devoid of meaning. These codes are meaningless and non-standard. They stand in the way of utilizing the data fully and pollute the ever-growing information repository of humanity.
Perhaps clarification is necessary: Enhancing data through means of applying formatting is not problematic; however, applying formatting geared simply towards beautifying output is. First we must ask ourselves a serious question, namely: What meaningful information is encoded when reproducing a particular format is the goal? The answer to this query is mindnumbingly simple: nothing meaningful is preserved! Perhaps some font face and size information, along with basic positioning instructions may be extracted, but these items cannot stand by themselves. These formats have virtually no definition of what these particular combinations of formatting attributes might semantically imply. Instead, they slavishly relegate the user of the data to use a particular application (perhaps even a particular version of that application), with particular fonts and other niceties, in order to even display the data. Those who have no such application must search exhaustively for a conversion tool, or they will never see the data in question.
The conventional view serves to protect us from the painful job of thinking.
John Kenneth Galbraith2
Formatting-based solutions are shortsighted because they assume that the particular format the solution chooses is the only format that the data will ever need to be provided in. In other words, the basic assumption is that once the data is created, it will remain static forever. Modifications may, at some point, be made, but they will require a particular application and operating system.
Formatting-based solutions enslave users to applications that understand the particular set of formatting codes within the text stream in order to display the data in a meaningful way. This must cease! If the promise of the Information Age at hand is to be fully realized, data must be encoded in a universally meaningful way. While convenience at design time lends the programming bourgeoisie's desire to rapidly release a solution; this convenience of design, storage, and display should not be confused with a convenience to use, update, and manipulate the data once encoded! Again, while basic text strings may be extracted from such formats (when formats are not of a compiled binary nature), what does this provide? It simply provides character strings devoid of meaning.
Let us examine this problem further.
It is assumed among the computing public that collected data is equivalent with knowledge. That is, the more information one gathers, the more knowledge one has. According to this incredibly fallacious logic, information is equivalent to knowledge. Thus, all one must do in this breech-birthed information economy is, simply, gather information in any form. As one gathers this generic, structureless information, one's knowledge grows in direct proportion. Therefore, any information gathered in any format is good; and there is no bad information.
The premise central to this theory is utterly false. The assumption that any information in any format directly equates with knowledge is wrong. For knowledge is not simply the acquisition of information, rather, knowledge is the storage and timely retrieval of information when required. Knowledge is the ability to apply existing information to new situations; it involves the ability to think laterally, and make connections when no implicit connections exist. Thus, to simply hoard information in a non-structured environment is a meaningless exercise, for such information will never again see the light of day. Information frittered away in random and unstructured archives is devoid of meaning and virtually useless to the computing public. The only conceivable manner in which to access this supposed "knowledge" is with a straight text search to retrieve matching character string combinations. Experience soon shows such methodology futile, for the chaff returned by such queries vastly outweighs whatever wheat one might find.
The fact that information does not directly equate with knowledge is blatantly obvious to even the simplest of minds that broach the subject. Promoting information gathering methods and storage systems that ignore the need for encoding meaningful structure is shortsighted and detrimental to the future. Such promotion is designed to serve the immediate monetary needs of the promoter. Instead, information must carry structure to be meaningful to the computing public, regardless of application or operating system.
This fundamental principal of structuralism is not a new concept, but has rather been proven throughout the history of man. The primary and most effective method of information dissemination to date has been the printed text. From inscriptions on stone tablets to mass-marketed paperback books, these methods of conveying knowledge have proven themselves effective over time, from culture to culture. The printed text transcends culture and time, providing effective means of communication to a large audience.
There are six factors that contribute to the success of printed text.
Cheap and commonly available media. Throughout the globe, paper and ink are available to virtually all classes and cultures. The ability to write and have that writing understood is shared across culture.
Entrenched distribution channels. Books are published and sold throughout the world, through distribution channels that have been developed and refined over time. When someone needs a particular text on a particular subject, he need not look too hard.
Non-proprietary format. Printed books appeal to the lowest common technological denominator. They are tangible and available for all cultures, despite the technological prowess (or lack of technology) of any given culture.
History of development. Written communication has been in development for thousands of years, from stone tablet, to papyri, to scroll, to folio, to bound and printed text. The printed text has been under development for over 500 years. It has been improved and refined constantly over those 500 years. The structural development of the printed text has taken place in such a way that today we simply take the inherent structure of the printed text the aspect that provides portability and universality for granted. In other words, the architecture of the text is virtually transparent to the user of the printed text, and simply leaves the content of the text to be delivered as intended.
Science of cataloguing. A science of cataloguing printed texts has grown and matured to the point where virtually any printed text is catalogued in some standardized manner. In addition, indices of periodicals are commonly available. Strangely, these indices are in the same format as the text they represent printed text. So, printed text has a sort of multidimensional quality in this particular aspect.
Central stores of information. Not only are texts catalogued, but central stores of printed texts including catalogues are commonly available from culture to culture. These information stores called libraries collect and catalogue printed information to ensure effective and efficient usage.
Based on the above criteria, it is obvious that data gathering/content generation has a long way to go and this point is not disputed. However, the current state of information gathering and retrieval falls far short of the above criteria. Sadly, the groundwork has not been laid. Rather than concern ourselves with creating a solid base to build the future of information storage and retrieval we have lapsed into pursuing quantity of data over quality of data.
Surely there has been the obligatory nod to the idea that a comprehensive and complete foundation must be built. Standards organizations work tirelessly towards this cause, striving to produce and codify standards that will enable the information economy for the long term. In the meantime, proprietary systems and formats have enacted their own seriously flawed solutions, solutions which render data virtually useless to the vast majority of the computing public. The short run desire to release some sort of product flawed though that product must be to stimulate income has been satisfied, and a deadly cycle has been started.
The cycle continues when dissatisfied members of the computing public vent their frustrations to the purveyors of these "solutions." Sadly, rather than addressing the structural shortsightedness of the "solution", software producers opt for quick "fixes" that only serve to worsen the problem in the long run by entrenching proprietary display-based formats among a growing user base.
Much like the heroin addict craves the next high, these short-sighted propagators of data impurity seek to release newer versions with increasing speed in order to satisfy their cash-hungry bottom line. This short-sightedness, if left unchecked, will surely pollute the global information economy while still in its infancy.
In the 20th century, specialisation has become the counterfeit of brilliance.
Not only are formatting-based solutions shortsighted, but formatting-based solutions targeted for the computing public are inherently based in pride. That is, formatting-based solutions prey upon the pride of the computing public in an effort to entrench themselves in the daily usage patterns of the computing public.
At this point it will be valuable to step back from the problem at hand and examine the degree to which pride has become a part of society today. The idea that one needs a specialized tool to perform a specialized function is exceedingly prevalent in today's society. This idea has been fed to us by advertisers and marketers throughout the world, and the public has bought the lie hook, line, and sinker.
Consider the exercise industry that has grown leaps and bounds over the past few years. The entire premise of this industry is that everyone, from Joe Six-Pack to Michael Jordan, must exercise and recreate with specialized tools for specific tasks. If a person plays a pick-up game of basketball, they must have the proper basketball shoes. However, if that same person then decides to play a game of racquetball, they must go and get their racquetball shoes, a racquet, and a racquetball. Heaven forbid if this person then wants go on a bicycle ride, because they'll have to get yet another pair of specialized shoes, not to mention other equipment, such as bicycle shorts, a helmet, and other accoutrements.
Structuralists propose that there is a need for some specialized items such the racquetball equipment of a racquet, ball, and eye protection but the other items are superfluous marketing piffle that has no real need, because other tasks share the same common needs and the majority of situations need no such specialized equipment. In other words, the vast majority of the public that exercises and recreates will receive little if any additional benefit from specialized tools such as specialized shoes for a particular type of task.
Westerners (Americans, in particular) insist on using "specialized" equipment even when they're not "specialists" because they want people to think they're specialists. Ability no longer is primary, rather the perception that one is able a perception promoted by the appearance of a specialist has become primary. The transition from an objective society (one that bases decisions and feelings on proven performance and ability) into a subjective society (one that bases decisions and feelings based on perception even in the midst of proven shortcomings) has solidified.
But how does this apply to structuralism? The idea of specialized tools for specialized tasks is at the heart of it. The problem is that content producers are targeting a single medium for their data, so they decide that in order to produce the best product, they must obtain a solution that specializes in producing output for the chosen medium. If a content producer is designing web content, they decide that they must get a web editor of some sort to use to assist in the production. People are stuck thinking that to create one output format, they need one output tool. These tools (which are seen as mutually exclusive) produce output (which is seen as mutually exclusive) for specific situations (which are seen as mutually exclusive). While the tool (in this case an HTML editor) may allow the novice to appear as some sort of specialist, the reality is that in most situations, the data produced has been polluted by the inclusion of proprietary extensions, coding reliant on server-based processing extensions, and documents that contain little or no structure no matter what their appearance. The pride of the consumer has been met, and the purity of their data the only truly valuable commodity in this newly forming information economy has been compromised.
In the scope of computing history; when each computer was essentially an island and information was rarely shared in electronic format, the formatting-based solution was adequate. Content was created for a particular need, to be printed out and disseminated on paper to those who might be interested. Attractive presentation enhanced these documents, allowing content producers to effectively communicate to their desired audience.
However, the paradigm is in the process of changing. The computing public does not only desire the ability receive or deliver content on the medium of paper, but increasingly the computing public desires to create and distribute information electronically, over local area networks, wide area networks, intranets, and even the internet. In addition, businesses (which are part of the computing public) desire not only to share information electronically across networks, they also desire to publish the same content in a variety of different forms from pamphlets to databases to books without needing to seriously overhaul the organization of the data. In short, the computing public requires data that can be repurposed. Today, content in a mutable format that allows for the efficient use of data to generate multiple output formats for delivery in multiple mediums is a growing need.
The software industry as it stands today has done a dreadful job of attempting to meet this ever-growing need. Solutions are inherently proprietary, demanding content providers and their customers use the same software for optimal results. A prime example of the proprietary nature of these solutions are the ever-present HTML extensions of Microsoft and Netscape. Rather than work to extend the standard to meet the needs of the computing public, both Microsoft and Netscape have chosen to incorporate extensions outside of the W3C HTML standard. Neither browser will accurately render content designed for the opposing browser. Data is normally produced with one of these two browsers in mind, and hence held captive to these extensions; which makes conversion into other formats or media all the more difficult.
As this need to render existing data in multiple output formats increases, the shortcomings of the current proprietary, formatting-based systems will come to light. Now is the time for structuralists to be clear in describing the problem, and now is the time for structuralists to be even more clear in specifying a solution.
First things first: Structure is markup. Markup is structure.
Structuralism, to be effective in the real world, must contain some degree of pragmatism. That is, Structuralists must always be conscious that solutions must not be designed for some utopian world, but solutions must be designed that the computing public can effectively use. A solution that cannot be implemented is not a solution at all.
The Structuralist is presented with several problems in the realm of implementation. The truth is, documents and data may not be nearly as structured as we'd like them to be. In Theodor Holm Nelson's article Embedded Markup Considered Harmful5, this argument is plainly and truthfully made; however it assumes a sort of "Structural Utopianism" exists among those who endorse Structuralism. Let the truth be known: Document-based formatting is not evil. However, document-based formatting is evil if it is used to archive documents, or if it is used as a base format for conversions to alternate media or formats.
Structuralism assumes that content providers desire to exploit the full value of their content. In other words, for the value of information to be fully exploited, it must be released in multiple forms across multiple media over a variable amount of time. No longer will a one-shot single output format allow a content provider to fully exploit their data. Therefore, formatting-based solutions (HTML, RTF, PDF, Quark Xpress, etc.) are still valid to present information in the traditional manner. The information, however, should be stored with the structure marked up, and use stylesheets to generate the proper output format for the desired media. This allows anything from form letters (paper or email) to web pages to press releases to be generated with a single source of data, in a format expected by the recipient.
Note the element of pragmatism here. The Structuralist doesn't pretend to advocate that structurally marked up documents are desirable in every situation, for every application and user. The Structuralist shouldn't foolishly advocate that every application everywhere understand and display SGML or XML with a stylesheet. Instead, the Structuralist advocates that valuable information be stored in a structure-based format (SGML or XML) that allows for easy repurposing of data to fit the problem du jour.
This idea and definition of Structuralism is still in its infancy. Ideas are still being developed. Problems are still being dealt with. Solutions are still being sought. This document does not pretend to present the position flawlessy. This document does not anticipate and answer all questions inherent to the discussion.
As questions arise, problems develop, and solutions are posed, these ideas will need to be revisited and re-examined.
www.W3.org: The World Wide Web Consortium (W3C) was founded in October 1994 to lead the World Wide Web to its full potential by developing common protocols that promote its evolution and ensure its interoperability.
www.XML.com: The mission of XML.com is to help you discover XML and learn how this new Internet technology can solve real-world problems in information management and electronic commerce.
www.WebStandards.org: The Web Standards Project (WaSP) is a collective effort of web developers, tool developers, and end users. Our mission is to stop the fragmentation of the web, by persuading the browser makers that common standards are in everyone's best interest.
www.ifi.uio.no/%7Epaalso/artikler/styles/new-version/pap.html: On the problems of migrating from formatting-based documents to structure-based (SGML) documents.
Copyright 1998. All Rights Reserved.
This page was last updated 09/03/1998 at 07:04:46 PM.
Questions, comments, or feedback is welcomed at firstname.lastname@example.org