The tables have not yet turned

I just spent two great days at the Text Mining Summit.  But before I post on all the good stuff, there's one stat -- used in almost every talk but seemingly very bogus -- that I want to vent about.

This is the claim that corporate data is 80% text (or 80% unstructured?) and only 20% structured/tabular.   I didn't actually catch a source being cited, but I have to think that this stat comes via some kind of a byte count -- 4X as many bytes being used to store text (on disk or paper?) as are used to store transactional information, or something like that.

Problem:  Numeric data contains a lot more information per byte than typical text does.  Storing the number 24,237 takes a lot fewer bytes than storing the text string "twentyfour thousand, two hundred and thirtyseven"  And that's even before we account for the way my 2209 characters of raw, unformatted notes from the conference (2649 counting spaces) have turned into a 24K Microsoft Word document.

Text processing is a wonderful thing, and enterprises don't do nearly enough of it.  But let's not get ridiculous in our enthusiasm; from an IT perspective, a large fraction of an enterprises information is indeed the stuff stored in rows and columns, in alphanumeric form.

What is Tech Briefcase?
TechBriefcase is a new, free service where IT Professionals can Search, Store and Share IT white papers and content like this. Learn more
Bookmark content
Speed up your research efforts with content across the web.
Search and Store
Find the white papers you need. Create folders for any topic.
View Anywhere
Open your briefcase on your iPhone, tablet or desktop. Share with colleagues.
Don't have an account yet?