Why Learn R? It's the language of Statistics

T_S_ · on June 24, 2010

The article provides a very good example in support of the argument domain specific languages to help you crank out domain specific code. I like R and use it. But I also code my own methods (in haskell these days, but that's not the main point). Before you go and build your company's code base around R realize the following:

1) The primary use case for R is by academics and students comparing various methods on various data sets accessible through R.

2) The code is not really designed with superb reliability in mind. I have debugged a contributor's fortran code and sent in a patch that never appeared in the R code base. The bug remains. Professors often don't support code all that well. Don't blame them. Support is left as an exercise.

3) There are no assurances about the scalability of any particular routine--even if the algorithm scales in theory.

Do try R. It's good. But don't think SAS and the like will disappear. They cater to the production requirements of big companies. And don't use R as an excuse not to write your own production code.

thangalin · on June 24, 2010

1) With the integration of R and PostgreSQL (PL/R), this might change. For example, calculate the area of a spherical complex polygon on Earth using PL/R:

  CREATE FUNCTION plr_polygon_area(
    latitude double precision[],
    longitude double precision[])
  $BODY$
    areaPolygon( cbind( longitude, latitude ) )
  $BODY$
  LANGUAGE 'plr' VOLATILE STRICT
    COST 1
    ROWS 1;

Pretty powerful.

2) The learning curve for R was 30 days for me. (Still learning, but everything is now no longer alien.) I submitted a bug to a Professor. Not only did he fix the bug within a few hours, but he offered to personally send a new build for my platform. He also suggested a performance improvement for my code (by practically rewriting it) that resulted in code 43 times faster.

The PL/R mailing list has been nothing but helpful and expedient.

3) Scalability of R functions is not too difficult to test.

T_S_ · on June 24, 2010

Foreign function calls are a good thing to support. R does this well and lots of R routines already call C or Fortran libraries.

Domain specific languages are good (and fun). Interacting with open source developers can be an awesome expersience. I agree.

I just don't see R as a platform that can integrate tightly with other business systems. It's a user oriented tool. If there are use cases out there that disprove this. It would be interesting to hear about. Especially as we move into the era of "big data".

thangalin · on June 24, 2010

It need not integrate tightly with other business systems. Especially as we move into an era of big data, it need only integrate well with databases.

I am creating a website that allows the general public to create reports on how the climate has changed, such as:

http://i.imgur.com/o8fTg.jpg

PHP, PostgreSQL, R, and JasperReports to analyse 273 million rows of data across 8000 weather stations spanning The Great White (soon to be Green by the looks of it) North for the last 110 years.

The trend line, shown in orange, is calculated in R using a Generalized Additive Model. There is no way I was going to (or even could) write such a complex algorithm myself. When I started the project, I was using MySQL. I migrated the database to PostgreSQL specifically so that I could use R for the analysis. I migrated the database before learning R.

T_S_ · on June 25, 2010

Good stuff.

tel · on June 24, 2010

You know, I often forget to mention formula notation as a reason I love R. It is superbly natural. R's natural support for data frame (collections of multidimensional observations) and formula notation make's it easy for every single disparate library to convene on a common language for high-level usage. Pretty near every function shares the same first two parameters:

   out <- f(formula, data-frame, ... other options ...)

Sukotto · on June 24, 2010

   Pretty near every function shares the same 
   first two parameters

Better then to make those two params passed by default so you don't have to type them every time.

scott_s · on June 24, 2010

The functions take the same idea as a parameter, not the exact same parameter. You still need to specify what formula the function will operate on.

thesnark · on June 24, 2010

I certainly agree that coming up the learning curve with R has greatly improved my working knowledge of statistics. However, lately I find myself becoming increasingly frustrated with the language itself as my projects grow in complexity. Of course this could be due to my lack of skill with R or programming in general... Does anyone else have this issue?

golwengaud · on June 24, 2010

I did, quite quickly. In particular, R seems to be focused on univariate data; support for multivariate data (things like "lego plot" histograms, or any kind of histogram in more than one variable) is patchy at best.

Having said that, I must note that I am new to R, and rejecting anything so quickly makes me very nervous. I suspect that I have not even come close to plumbing the depths, so to speak, of R's capabilities.

jacobolus · on June 24, 2010

To both of you: I’m far from a statistician, so YMMV, but after trying to work with multivariate data in both R and MATLAB, I really found using numpy/scipy to be substantially nicer. And even better, if you ever decide to do anything else in a program, like munge text or interact with internet services, a general-purpose language like Python is a big advantage.

wildanimal · on June 25, 2010

You can check out the lattice, ggplot2, scatterplot3d and rgl libraries to see how R handles multivariate data - R is especially well suited for analyzing panel (longitudinal) data with multiple variables.

wildanimal · on June 25, 2010

Google apparently runs R on desktops (http://dataspora.com/blog/predictive-analytics-using-r/) and ports it to Python or C++ once their ideas are tested (for production code). Having said that if your list of functions are growing, try packaging up in a library - there are many tutorials available on the web. If your data is becoming big, try using RHIPE (Hadoop integration), bigmemory, snow, or a number of other packages: http://cran.r-project.org/web/views/HighPerformanceComputing...

buckwild · on June 24, 2010

If you guys like using R for statistics, you should definitely try S/S-PLUS ;-)

http://en.wikipedia.org/wiki/S_%28programming_language%29

jacobolus · on June 24, 2010

Why? (Genuinely curious; your comment is just a throwaway line, but you probably wouldn’t have said it if you didn’t have some real reasons.)

buckwild · on June 25, 2010

Matlab, R, and S-PLUS are all different animals. They are usually used for different things (and what they are used for can vary from person to person). For instance, I use matlab for object oriented scientific programming, R for quick and dirty (but math intensive) scripts, and S-PLUS for quick and dirty stats intensive scripts. Long story short, S-PLUS is geared towards statistics (better modules, functions, etc). If you do some searching, you'll find that R is actually an offspring of S, with S-PLUS being R's sibling. If one can program in R, picking up S-PLUS should be cake since the syntax and programming hats are similar.

I don't really like to type that much, hence my initial terse commentary--but I can certainly oblige someone who is genuinely interested :-)

jacobolus · on June 25, 2010

> Long story short, S-PLUS is geared towards statistics (better modules, functions, etc).

This is still an extremely uninformative answer. Oh well.

T_S_ · on June 25, 2010

R and S-Plus are very similar. Curious when you would prefer S-plus over R?

wildanimal · on June 25, 2010

S-PLUS objects reside on the hard drive whereas R's are stored entirely in memory. So R is faster for smaller computations but runs against limitations when the data sets are large, though you can use the bigmemory library or store your data in external databases - e.g., SQLite or PostgreSQL and pull off chunks as you need them. R also had a much more extensive library but I heard that S-PLUS (as of version 8) made their program compatible with R so that R's libraries could be used in S-PLUS. Also, R has lexical scoping; I think S-PLUS only has global and local like Matlab. I personally like lexical scoping so can't think of cases when you'd find S-PLUS's scope definition advantageous.