In this post, we will examine the data
that we accessed previously using R. We repeat the steps from our
previous post to load the RpostgreSQL driver and to load the data. In
this case, it is hourly traffic counts from New York state.
We first run a few queries to load the
data.
> library(RPostgreSQL)
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, user =
"nydot", password = "nydot", dbname = "nydot",
host = "localhost", port = 5432)
> rs <- dbGetQuery(con, "select
rcstation, start_time, direction, lane, count from traffic")
> plot(rs$lane, rs$count)
Next, we are plotting the start_time
with the count
> plot(rs$start_time, rs$count)
This gets us another ridiculous chart
revealing that in fact our data is across many months.
To understand whether the data is
collected for each station across the year, I plot the following
> plot(rs$start_time, rs$rcstation)
This shows that data is actually spread
over twelve months in a year. To understand whether the data is for
the entire year for each station, I applied a filter on the dataset
for one rcstation (360001 in this case) and created a plot to see the
data range.
In the following case, I have applied
the filter and am plotting the start time.
>
plot(rs[rs$rcstation==360001,]$start_time)
As, we can see from the plot below, the
data is for a single station is actually spread out over one week.
Another way of looking at the data
would be the total count for each day of the week. First we would
like to check the range of values for the specific days.
We can use the range command to check
for this...
>
range(rs[rs$rcstation==360001,]$start_time)
In our case, it gave the following
response.
We can plot a histogram for a single
count station using the following formula
>
hist(rs[rs$rcstation==360001,]$count)
This displays the following histogram.
That's it we are done for our first exploration. We will pick up more advanced stuff as we go along...
No comments:
Post a Comment