Search This Blog

Sunday, February 23, 2014

Connecting to PostgreSQL database to do R analysis

In the previous post, we saw how to install R on Ubuntu. To perform analysis on PostgreSQL data, we need to have the RPostgreSQL library installed. Installing on Ubuntu is simple, and you need to follow the scripts listed below.

Install libpq-dev library

First step is to ensure that the libpq-dev library is installed. This can be installed by entering the following command.

$ sudo apt-get install libpq-dev




Install RPostgreSQL Library


Now that we have the library installed, we need to setup PostgreSQL library for R on our machine. To do this, we need to start R as a super user.

$ sudo R




Next step is to run the command to install additional packages. Enter the following command on the R prompt.

> install.packages()



This prompts a bunch of mirrors from where new packages can be downloaded from. Choose your location from the list.




Choose the appropriate mirror for your location and continue...

Next, we need to choose the package. Choose RPostgreSQL, as shown below



Clicking OK will start the download and installation.


Now let's see if we can connect to a database, and use the R library.

To do so, we need to load the RPostgreSQL library, load the driver for PostgreSQL and make a connection to our database with an authorized user and password.

Here are the steps to do so....
> library(RPostgreSQL)
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, user = "nydot", password = "nydot", dbname = "nydot", host = "localhost", port = 5432)
This loads the library and connects to the database as shown in our terminal window.



Now that the connection is made, I can fire a query to load the data. You do so by using the command dbGetQuery as shown in the following statement.

> rs <- dbGetQuery(con, "select rcstation, start_time, direction, lane, count from traffic")

This assigns the results of the query to our object.


Just typing rs on the console will list all the data that got loaded.

> rs

Will list all contents of the object containing the results of the query. If we now wanted to check the mean count of all the data in the results we could do so as follows:

> mean(rs$count)

Here mean() is the function that we want to execute that calculates the average; rs is the object holding our results and rs$count refers to the field “count” in our resultset.



That's it.... We saw how we can calculate statistics from datasets in a PostgreSQL database.




No comments: