Search This Blog

Saturday, March 8, 2014

Analyzing statistical data using R

In this post, we will examine the data that we accessed previously using R. We repeat the steps from our previous post to load the RpostgreSQL driver and to load the data. In this case, it is hourly traffic counts from New York state.

We first run a few queries to load the data.

> library(RPostgreSQL)

> drv <- dbDriver("PostgreSQL")

> con <- dbConnect(drv, user = "nydot", password = "nydot", dbname = "nydot", host = "localhost", port = 5432)

> rs <- dbGetQuery(con, "select rcstation, start_time, direction, lane, count from traffic")

> plot(rs$lane, rs$count)

Next, we are plotting the start_time with the count

> plot(rs$start_time, rs$count)

This gets us another ridiculous chart revealing that in fact our data is across many months.

To understand whether the data is collected for each station across the year, I plot the following

> plot(rs$start_time, rs$rcstation)

This shows that data is actually spread over twelve months in a year. To understand whether the data is for the entire year for each station, I applied a filter on the dataset for one rcstation (360001 in this case) and created a plot to see the data range.

In the following case, I have applied the filter and am plotting the start time.

> plot(rs[rs$rcstation==360001,]$start_time)

As, we can see from the plot below, the data is for a single station is actually spread out over one week.

Another way of looking at the data would be the total count for each day of the week. First we would like to check the range of values for the specific days.

We can use the range command to check for this...

> range(rs[rs$rcstation==360001,]$start_time)

In our case, it gave the following response.

We can plot a histogram for a single count station using the following formula

> hist(rs[rs$rcstation==360001,]$count)

This displays the following histogram.

That's it we are done for our first exploration. We will pick up more advanced stuff as we go along...