Publication | Legaltech News
Nervous System: The Context of Keywords
In November 1958, data scientists and engineers convened at an international conference to share their latest discoveries and inventions. One inventor stood out. While other computer scientists of the day prized their ability to create machines that could sift and sort numbers, Hans Peter Luhn created a machine with the seemingly magical ability to process human language. His machines efficiently sifted and sorted words. Luhn’s Key Words In Context indexed not only individual words but also their surrounding words. This made it possible to not only find a given term, but to find it in a given context. Because Luhn’s inventions have become commonplace features of modern eDiscovery, to understand their revolutionary nature requires mentally stepping back into the mindset of attendees at that conference some sixty-five years ago.
The central concern of information science is the optimal organization of information to best facilitate its use. Historically, this meant finding the ideal organization of physical documents on shared premises, on the assumption that a user interested in a certain publication likely would be interested in something similar. The advent of the age of the computer and electronic storage of documents turned this concept on its head.
Luhn and his contemporaries were grappling with what it meant when electronic storage could allow for the aggregation of massive collections of data, for which the most important approach to organization was concerned less with specific physical location of any document than with its informational content. As it happened, however, this was the very question that most animated Luhn.
For example, in 1953, in his role as manager of the Information Retrieval Research Division at IBM, Luhn invented the concept of “hash coding.” The concept involved applying an algorithm to a piece of electronic data to derive a numerical representation that could be used to catalog it. The hash value would serve as an index entry to speed up searching. Instead of searching through a vast set for a specific record, the system could quickly calculate a hash value that would be faster to identify in a sequential list.
The following year, Luhn patented a “Computer For Verifying Numbers.” Although what Luhn patented was the handheld device itself, the concept underlying the physical contraption was what mattered—and what has lasted. American society was increasingly oriented around various identifiers, such as Social Security Numbers, bank account numbers, and so on. In turn, Luhn sought a way for an authority to verify whether any given identifier was valid.
The idea was to perform a series of calculations on the identifier to generate a single digit that could be appended to the end of the identifier. To validate the integrity of the ID number, one simply needed to take the sequence of leading digits and re-run the same algorithm—if the output matched the last digit, the sequence was proven valid. The concept of a sum used to check the overall integrity of the ID, or a checksum for short, remains in use today throughout information technology systems as a mechanism for detecting data errors.
At the International Conference on Scientific Information in Washington, DC in November 1958, Luhn unveiled the Key Words In Context system. As originally conceived, the system was designed to develop multiple index entries for academic papers for the most important—that is, key—words in their titles.
By way of illustration, consider the paper in which Luhn described this new approach. Its title was “Bibliography and index: Literature on information retrieval and machine translation.” Ignoring those title words that only serve linguistic function (in this case, “and” and “on”), the keywords of the title are “bibliography,” “index,” “information,” “retrieval,” “machine,” and “translation.” Each of those six keywords would get its own index entry, in which the immediately surrounding words of the title would be included as context. Luhn’s system would automatically generate these indices based on the keywords of the title, without relying on a discretionary read of the paper’s content.
Shortly after unveiling Key Words In Context, Luhn published a related article—“A Business Intelligence System”—in which he proposed using a variation of the automated indexing system to algorithmically generate executive summaries of documents. Importantly, in light of modern uses of artificial intelligence (AI), the intention was to allow machines to process text-based information on their own without human readers. Luhn supposed that large organizations could use the technique to swiftly distribute information to the right stakeholders. In a proof-of-concept demonstration, Luhn fed a long scientific article into a computer, which quickly extracted a four-sentence summary. The machine had seemingly understood the written text, when in fact it had merely performed a statistical analysis of its use of language.
As “A Business Intelligence System” demonstrated, the use of automated keyword indexing had uses beyond parsing the titles of academic papers. The concept has since been applied to indexing the entire contents of texts while excluding nonsignificant “stop-words.” With relatively few revisions, the Key Words In Context technique remains in use as an everyday approach to indexing electronic texts, including those handled by eDiscovery review platforms.
The views and opinions expressed in this article are those of the author and do not necessarily reflect the opinions, position, or policy of Berkeley Research Group, LLC or its other employees and affiliates.