In the previous post, we saw how to
install R on Ubuntu. To perform analysis on PostgreSQL data, we need
to have the RPostgreSQL library installed. Installing on Ubuntu is simple, and
you need to follow the scripts listed below.
Install libpq-dev library
First step is to ensure that the
libpq-dev library is installed. This can be installed by entering the
following command.
$ sudo apt-get install libpq-dev
Install RPostgreSQL Library
Now that we have the library installed,
we need to setup PostgreSQL library for R on our machine. To do this,
we need to start R as a super user.
$ sudo R
Next step is to run the command to
install additional packages. Enter the following command on the R
prompt.
> install.packages()
This prompts a bunch of mirrors from
where new packages can be downloaded from. Choose your location from
the list.
Choose the appropriate mirror for your
location and continue...
Next, we need to choose the package.
Choose RPostgreSQL, as shown below
Clicking OK will start the download and
installation.
Now let's see if we can connect to a
database, and use the R library.
To do so, we need to load the
RPostgreSQL library, load the driver for PostgreSQL and make a
connection to our database with an authorized user and password.
Here are the steps to do so....
> library(RPostgreSQL)
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, user = "nydot", password = "nydot", dbname = "nydot", host = "localhost", port = 5432)
This loads the library and connects to
the database as shown in our terminal window.
Now that the connection is made, I can
fire a query to load the data. You do so by using the command
dbGetQuery as shown in the following statement.
> rs <- dbGetQuery(con, "select rcstation, start_time, direction, lane, count from traffic")
This assigns the results of the query
to our object.
Just typing rs on the console will list
all the data that got loaded.
> rs
Will list all contents of the object
containing the results of the query. If we now wanted to check the
mean count of all the data in the results we could do so as follows:
> mean(rs$count)
Here mean() is the function that we
want to execute that calculates the average; rs is the object holding
our results and rs$count refers to the field “count” in our
resultset.
That's it.... We saw how we can
calculate statistics from datasets in a PostgreSQL database.