Industry


Ads by TechWords

See your link here


Subscribe to our e-mail newsletters
For more info on a specific newsletter, click the title. Details will be displayed in a new window.
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
More E-Mail Newsletters 

Web-scraping solutions?

There are times you want to automatically gather information from web pages, possibly from sites whose owners would not derive great joy from you doing this. 

This is not a terribly hard technical problem, but it is also far from trivial given all the weird things that can happen on websites these days.

Well, I ran into a couple of guys at the Text Mining Summit who claim to solve just that problem.   Their company -- the amusingly named Scrapegoat  --  seems to consist of a few programmers and a toolkit, which probably is continually enhanced as ever more webpage weirdnesses are discovered.   They prefer to do the programming for you, but will sell you the toolkit for your own use if you absolutely insist.

I haven't actually checked out their work, talked with customers, or even seriously given them my usual third-degree grilling.  In other words, I haven't done proper analysis on them at all.  But with those rather comprehensive disclaimers out of the way -- my hunch is that you won't be sorry you gave them a call.

EDIT:  At first blush, it appears that QL2 is a more industrial-strength and productized version of the same thing.  I'll try to check them out and report back.

What People Are Saying

Look at

Look at SWExplorerAutomation(http://home.comcast.net/~furmana/SWIEAutomation.htm)

SW Explorer Automation (SWEA) creates an object model (automation interface) for any Web application running in Internet Explorer. The automation interface consists of pages (scenes) and controls. The page consists of controls. The following controls are supported: HtmlContent, HtmlAnchor, HtmlImage, HtmlInputButton, HtmlInputCheckBox, HtmlInputRadioButton, HtmlInputText, HtmlSelect, HtmlTextArea. The object model is defined visually by SWEA designer. The designer allows to record scripts (C# and VB) based on the defined application object model.

It is very easy to create a scraping solution for any Web site useing SWEA.

Dumbed Down Tools are

Dumbed Down Tools are coming.

A comment was made at the text mining summit regarding the current text-mining tools having difficulty going mainstream due to the inability to "dumb them down" enough to make them easy to use for consumers. Data Scraping tools seem to have this same issue.

At ScrapeGoat.com we evaluate any scraping tool we find - large or small. Most of them aren't worth the time to download. Occassionally we see some pretty neat and sophisticated applications that have some merit, such as the QL2 software.

Yet, without exception, we have found the applications either to be ineffective or to have a lengthy and steep learning curve required to use them. (Even with QL2's SQL/XPath approach we were several hours into the tutorials before we could do anything useful with the tool.)

Even when an enterpise has a suitable programmer, learning the tool is still required, as well as how to extract fields from the data source, and how to avoid the inherent pitfalls that are often encountered while data scraping, such as getting blocked from the data source, dealing with planted or tainted data, or even damage to the company's reputation. On top of all this, the company still has to pay a respectable wage to the developer possessing the necessary skills.

At ScrapeGoat we believe we offer the best data harvesting and acquisition options for business entities of all sizes. For small businesses that don't have a programmer with the right talent, our custom services provide an alternative to (not quite dumbed down enough) boxed software.

For companies possessing the resources and required skills to implement the more pricey data scraping API's, ScrapeGoat can often have the data delivered and formatted before the programmer or team finishes discerning the API tutorials and specs. For long term projects, ScrapeGoat has repeatedly proven more cost effective than paying in-house coders and project managers.

Until such time as data scraping tools can be used by many more people - including those without a technical degree - those of us in the industry who offer custom services will be ahead of those offering boxed solutions.

As one ScrapeGoat client put it, "All the generic web scrapers are too complicated or just useless. The bot you created is simple yet gets the job done better than I expected!" While we don't expect that sentiment will always be the case, for now it seems to be a common theme we are hearing from our customers.

Later this year, ScrapeGoat plans to release a beta version of a "productized" tool we expect to be powerful enough for most web scraping needs, yet easy enough for people without programming aptitude to use effectively.

Aaron Willis said some kind

Aaron Willis said some kind things about QL2 Software (I know I sound like a Snapple commercial) but I wanted to clarify a misconception that QL2’s products are strictly off-the-shelf. While some of our customers’ expectations are met by QL2 Solutions, hosted applications with restricted scope that require no programming skills, many choose WebQL, our software solution. WebQL is installed locally and requires programming expertise, albeit expertise readily available in most IT shops. When the job is esoteric (or programming resources scarce), we frequently provide our customers with custom programming services. Failure to meet customer expectations is fatal in any business but especially is a business space as specialized as ours.

Personally, I’d welcome a discussion about unstructured data acquisition. I think transparency and communication would benefit both the industry and the customer and help clear the air of marketing spiel. Just don’t ask me about Gopher, Veronica, or Archie. I haven’t thought about those since I owned a Kaypro with a screen the size and color of an oscilloscope.

Hi Curt, It was a pleasure

Hi Curt,

It was a pleasure meeting you at the Text Mining Summit in Boston. We appreciate your complimentary write up about us. We also invite you to give us your third-degree as it will clear up the mis-perception that ScrapeGoat is less of an industrial strength solution.

While our hats are off to the data scraping API that QL2 offers, (It is a sweet bit of software with lots of thought and work put into it) it does nothing that we aren't already doing or have done in the past. Additionally we offer custom services that we were unable to accomplish using QL2's API, such as Telnet scraping, 3rd party windows and linux application screen scraping as well as a variety of other tasks.

True that QL2 is more productized, we cater to businesses that prefer to outsource their data collection needs. Our position is that we can handle most company's data scraping projects faster, more accurately and typically at lesser expense than they would expend purchasing, training and paying a programmer to use the QL2 or other boxed data scraping application.

Our "toolkit" consists of vast libraries (over a million lines of custom code - written over the last 6 years). We do not sell our toolkit. However if a company prefers a software application that will run on their servers, we will use our toolkit to build them a project specific application to meet their needs.

Although ScrapeGoat is relatively a new company, our years of data scraping and unstructured data mining experience coalesced under the umbrella of our web development parent company that still remembers Gopher, Veronica and Archie.

We would be delighted to see this blog become a discussion of the pros and cons of outsourcing, custom and boxed data scraping solutions. We especially welcome QL2 to contribute to this blog and offer insight into their, we must admit, very cool software.

Aaron Willis
aaron@ScrapeGoat.com
http://www.ScrapeGoat.com
/(bb|[^b]{2})/