Sharon Machlis

Is Python really supplanting R for data work?

November 26, 2013 1:28 PM EST

After years of chortling at those who could get worked up enough to launch flame wars over Windows vs. Mac vs. Linux or iOS vs. Android, I seem to be engaged in one of my own: Python vs. R.

Except this one isn't quite the same, since we're not arguing over which is better for data analysis (I've got no idea, since I know little Python beyond "Hello, world") but which is more popular. And for that claim, sorry, I think some data is required.

It started with a piece headlined Python Displacing R As The Programming Language For Data Science. Now, if you're not familiar with R, one of the most appealing things about it is its large community and thousands of add-on packages making it easier to do tasks ranging from analyzing your Twitter feed to creating interactive graphics. If R is going to be "displaced" for data work, meaning the community is likely to dwindle, that's a pretty serious charge.

So I went to read the article, expecting to see some comparative data, and instead saw an opinion piece. Boiled down, here's the premise: "Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master. In a world increasingly dependent on data and starved for data scientists, 'easy' wins."

Prompting my somewhat inflammatory tweet:

To his credit, author Matt Asay responded, although I was less impressed with the actual answer:

Oooookay, no need to actually prove the claim in the headline within the text of the article itself, as long as you link your sources. We're no longer just readers; we're now part of the research process itself, expected to collect information from primary sources ourselves and help the writer bolster his premise. Very well, off to click I went.

The first link is another of his own articles, this from PyCon, which also claims "Python is fast becoming the Big Data language of choice for the enterprise." The data proving this? 2,500 people attended the Python conference (no comparisons to attendance at conferences for other languages), Python job growth listings are rising faster than several other languages (that's overall job growth and not data-science-specific listings), and a $3 million DARPA investment in a company helping to improve Python's data processing and visualization capabilities. (Of course, DARPA has invested millions more on other computing projects; that by itself is an anecdote, not data).

There are several more links to articles about R being hard to learn, Python's a great language and it's better to do all your work in one tool than have to use several specialized ones. There's even a link to an IDG News Service story right here on Computerworld that points to the usefulness of Python for big data work.

Yes, from all I've read about and heard of Python, it's a popular language and very useful for data work. But that's not the same thing as displacing an alternative. iOS 7 is a great platform for mobile devices and very easy to use, but that doesn't mean it's displaced Android, does it?

In fact, even after following links in the article, I didn't find a single actual data point that tells me whether or not Python's increasing general popularity is a) specifically for data science or b) at R's expense.

In the Twitter burst that followed my initial comment, a few tried to come up with actual data points. R guru Hadley Wickham posted some on GitHub, including graphs of questions on the programming Q&A site Stackoverflow for the two languages, numbers of GitHub repositories and queries on Google (although that's a tough one to quantify, given the difficulty of searching on Google for "R"). His data show that interest in both languages is rising on both Stackoverlow and GitHub. While this can't parse out how much of Python's growth relates to data science, it does make the premise that Python's growth is at R's expense somewhat questionable.

The most relevant data presented on Twitter came from RedMonk co-founder Steve O'Grady, who pointed to the language poll on KDnuggets, a site specifically for data mining and analysis. That showed 61% of respondents using R this year vs. 53% in 2012; while 39% used Python this year vs. 36% in 2012. In addition, " people who use R are about 13% more likely to use Python than overall population," according to the survey.

Are KDnuggets readers who answered the poll a representative sample of the data science community as a whole? I'm not sure, but this is one of the best data points I've seen as to the relative popularity of the languages for data work. And it sure doesn't show Python displacing R.

So yes, I'll agree that many people think R is hard. (Many people also think data science is hard, but that doesn't seem to be slowing the field.) I'll also agree that Python is an elegant and popular language useful for data work. I've got nothing against Python; if I had the time, I'd be interested in learning it myself. But it's still a big leap from "R is hard and I like Python" to "Python is displacing R." And as any good data scientist knows, the burden is on the researcher who makes a claim to prove it, not on his or her readers to conduct research in order to find it false.

See more from the Data Avenger series.