The tables have not yet turned
- IT TOPICS:Applications, Privacy
I just spent two great days at the Text Mining Summit. But before I post on all the good stuff, there's one stat -- used in almost every talk but seemingly very bogus -- that I want to vent about.
This is the claim that corporate data is 80% text (or 80% unstructured?) and only 20% structured/tabular. I didn't actually catch a source being cited, but I have to think that this stat comes via some kind of a byte count -- 4X as many bytes being used to store text (on disk or paper?) as are used to store transactional information, or something like that.
Problem: Numeric data contains a lot more information per byte than typical text does. Storing the number 24,237 takes a lot fewer bytes than storing the text string "twentyfour thousand, two hundred and thirtyseven" And that's even before we account for the way my 2209 characters of raw, unformatted notes from the conference (2649 counting spaces) have turned into a 24K Microsoft Word document.
Text processing is a wonderful thing, and enterprises don't do nearly enough of it. But let's not get ridiculous in our enthusiasm; from an IT perspective, a large fraction of an enterprises information is indeed the stuff stored in rows and columns, in alphanumeric form.

