Industry


Ads by TechWords

See your link here


(Long) Notes from the Text Mining Summit

The Text Mining Summit held earlier this week was one of the best conferences I’ve been to in a long time.  This was largely because of its small scale, which turned the whole conference into one long ongoing conversation.  I find that this still happens fairly often at single-company analyst meetings (investor or product/industry), but very rarely at multivendor conferences such as this one.

Retroactively, the definition of “text mining” used at the conference appears to have been, in essence, “Making inferences from large corpuses of text.”  This has two main subcategories:  Data mining that uses text data (“statistics about statements”), and everything else.

The original market for text mining was apparently intelligence agencies – who not coincidentally were also the original market for relational and text databases -- but the explosive growth that started late last year has largely been fueled by CRM.  A lot of other applications have emerged too, as discussed below.

There was a general view at the conference that text mining will increasingly inform and otherwise meld with search.    However, it is an easy mistake to see text mining and search as being closer than they really are (and indeed that confusion was one of the few significant mistakes made by the conference organizers).

After CRM and intelligence, the next big application area is probably product quality.  The auto industry in particularly has had some high-payoff projects analyzing warranty claims for early warning of product defects.  In general, the term “early warning” arose in a number of application domains, and a number of attendees saw it as a major future growth area.   Fraud detection is also big, just as it is in more traditional forms of analytics.

A couple of speakers (from Factiva and Temis) presented applications in which companies could text mine the Web to assess their own corporate or product reputations and images.  To some extent, this fits into the “early warning” category.   This particular market is in its very early stages; on the one hand I thought the speakers were exaggerating the importance of their products, while at the same time I thought they were overlooking major opportunities for those products to create significant value.

Other uses of text mining include, in no particular order:

  • Examining research papers to discover pathways, interactions, etc. of interest.  This may be terribly important in saving lives, but the market is basically limited to large drug companies, and there aren’t a lot of those.  Based as much on a Healthcare-IT conference a few weeks as on the text mining summit itself, I gather this is a $10 million or so market with over 10 worthy competitors.
  • Mining resumes to see what characteristics have led to successful hires, as a guide to future hiring decisions.  It is easier to make such judgments if you’re a temporary or permanent staffing agency than if you’re the end-user of the employees.   However, EDS also spoke of a human resources application it was using.
  • Automated classification as a shortcut to data cleaning.  SAS proposed this one.  I’m not sure that it really counts as text mining, but if it works it’s surely a cool idea.
  • Early warning of third-world epidemic outbreaks.  One would think that the World Health Organization’s system for checking out reports of infectious diseases would be reliable enough that it could never be outperformed by a system that mines journalistic articles available on the Web.  According to a presentation at the conference, one would be wrong.  However, the speaker on this subject, while stating that text mining gave great assistance during the SARS outbreak, was quite shy about giving specific details of the information discovered or the way that information helped save lives.

And this is only a partial list (apologies to those left out, and you're encouraged to please add them as comments to this post)

Notwithstanding the AI-like nature of the subject matter, the conference was often grounded in reality.  Questions from the floor were often detailed and incisive.  And while some speakers were pie-in-the-sky overoptimistic, there was a general concession that some areas of technology weren’t quite ready yet.

In particular, views on voice mining were mixed, but a rough consensus would be as follows:  Mining of transcriptions of voice in a limited application domain works pretty well today.  Non-textual aspects of a voice communication, such as intonations, aren’t mined yet.  One of the big difficulties in mining voice is of course signal noise; those transcriptions have an annoyingly high error rate.  One speaker suggested that the signal quality really matters; e.g., telephone intercepts for intelligence/law enforcement are problematic, while broadcast media are a lot easier to deal with.  Also problematic is that, like IMs, voice interactions are typically less grammatically structured than even email, let alone more formal written documents.

There was also considerable discussion of ontologies and taxonomies.  These are important things for text mining.   Off-the-shelf ones don’t work; they are surprisingly company- and application-specific.  Wholly automatically generated ones don’t work either.  Combining ontologies is basically an unsolved AI problem.   And I got a lot of agreement to an observation I made early on – wherever there are ontologies, there are consulting engagements.   Ontologies are a huge reason why text mining “solutions” are rarely just packaged plug-and-play apps.

In related news, I've already made a couple of posts on specific points derived from the summit, namely web-scraping (obviously important to web data mining) and my frustration with a much-overused, probably bogus statistic.

What People Are Saying

Did many people really

Did many people really regard that as news?  I got lots of positive feedback on my comment that "Whereever there are ontologies, there also is consulting."  It felt like everybody in the room personally agreed with that, even if they thought it is a point too often glossed over in marketing hype.

For more information about Curt Monash, see his bio.

The most upfront speakers

The most upfront speakers there were the Attensity people who talked about the knowledge acquisition "elephant". They may not have anything out yet, but were the first people to truly acknowledge that building all the NLP resources up is painstaking and time-consuming. It's surprising that many of the companies thought this to be news...

CRM alone has a lot of applications for text mining.
-HR debriefing of employee opinion documents
-call centre management in cellphone providers & internet service providers to reduce turnovers