Search This Blog

Tuesday, September 29, 2015

Re-projecting shapefile data during load in PostGIS

I recently came across the need to load data into PostGIS from Shape files. However, some data was in one projection and other data was in a different projection system. I needed to re-project the data at the time of loading the data.

The process is fairly straightforward. The syntax is

shp2pgsql -s :

Here is an example using real data..

Let's look at the data

gdalsrsinfo nybbwi.prj


As we can see, the projection is NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet, whose EPSG code is 2263. Since, I would like to load the data, I can use shp2pgsql command to create the load file.

shp2pgsql -d -s 2263 -g geom -e -I nybbwi nystaging.boroughs > $scriptdir/021_boroughs.sql

The next file was in a different projection. Here I am using a different command ogrInfo to get the projection information.

ogrinfo -al -so fatality_yearly.shp

This reveals that projection is latlong, or EPSG code 4326


As explained previously, we can use the -s option to re-project the data. The following is the command

shp2pgsql -d -s 4326:2263 -g geom -e -I fatality_yearly nystaging.fatality > $scriptdir/023_fatality_yearly.sql

shp2pgsql -d -s 4326:2263 -g geom -e -I injury_yearly nystaging.injury > $scriptdir/024_injury_yearly.sql


That's it...

Here is the complete script...

# 04_load_spatial_nyc_data.sh

export datadir=../data/shpfiles
export scriptdir=../../scripts

export curdir=~/Work/Transportation/NYC/scripts
cd $curdir

export PGUSER=nyc
export PGPASSWORD=nyc
export PGDATABASE=nyc


echo $datadir
echo $scriptdir

cd $datadir

gdalsrsinfo nybbwi.prj
# EPSG code for NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet is 2263
shp2pgsql -d -s 2263 -g geom -e -I nybbwi nystaging.boroughs > $scriptdir/021_boroughs.sql

gdalsrsinfo PedCountLocationsMay2015.prj
ogrinfo -al -so PedCountLocationsMay2015.shp
shp2pgsql -d -s 2263 -g geom -e -I PedCountLocationsMay2015 nystaging.pedcounts > $scriptdir/022_pedcounts.sql

ogrinfo -al -so fatality_yearly.shp
shp2pgsql -d -s 4326:2263 -g geom -e -I fatality_yearly nystaging.fatality > $scriptdir/023_fatality_yearly.sql
shp2pgsql -d -s 4326:2263 -g geom -e -I injury_yearly nystaging.injury > $scriptdir/024_injury_yearly.sql

cd $scriptdir

psql -f 021_boroughs.sql
psql -f 022_pedcounts.sql
psql -f 023_fatality_yearly.sql
psql -f 024_injury_yearly.sql


Monday, September 7, 2015

Connecting to PostGIS from R

In this post, we will connect to spatial datasets loaded in PostGIS from within R.

First I have set environmental variables to connect to PostgreSQL in my environment. These are as follows

export PGUSER=nyc
export PGPASSWORD=nyc
export PGDATABASE=nyc

Next, we will launch R, and load the required library, rgdal. If you don't have rgdal installed, please refer to initial steps of my earlier post.


Now lets get started
# 05a_connect_ped_postgis.r
# Load required library
library(rgdal)

This loads the library


Next we read the PostgreSQL connection parameters from the shell environment using Sys.getenv command

#Set PostgreSQL connection parameters from the environment
pguser<-Sys.getenv('PGUSER')
pgpassword<-Sys.getenv('PGPASSWORD')
pgdatabase<-Sys.getenv('PGDATABASE')


Next we create the database connection by concatenating all variables into a connection string using the paste function in R , and we print to confirm we have it right.

#Make connection to PostGIS database
dsn<-paste("PG:dbname='",pgdatabase,"' user='",pguser,"' password='",pgpassword,"'",sep='')
dsn


Now we can use the ogrListLayers command to make sure we can see our previously loaded data.

ogrListLayers(dsn)


The command makes a connection to the database and lists all layers registered with PostGIS. Now we can also look up the details for a specific layer of our choice using ogrInfo function

ogrInfo(dsn, "nystaging.fatality")


Here is the complete listing..

# 05a_connect_ped_postgis.r
# Load required library
library(rgdal)

#Set PostgreSQL connection parameters from the environment
pguser<-Sys.getenv('PGUSER')
pgpassword<-Sys.getenv('PGPASSWORD')
pgdatabase<-Sys.getenv('PGDATABASE')

#Make connection to PostGIS database
dsn<-paste("PG:dbname='",pgdatabase,"' user='",pguser,"' password='",pgpassword,"'",sep='')
dsn
ogrListLayers(dsn)
ogrInfo(dsn, "nystaging.fatality")

Saturday, September 5, 2015

Installing Tiger Geocoder on PostGIS 2.1 on Ubuntu using a single shell script

In this post, I am going to walk through the steps of installing Tiger Geocoder on Ubuntu. These are the steps that worked for me and are collated from multiple sources and forum postings. My intent was to create one script that did everything from start to finish (included at the end), that took a while to perfect.

In this post, I will walk through the script, so you can modify it for your environment. Lets get started.

First step is to set some environmental variables that we will use. This way we don't have to enter this over and over again. As you can see, I plan to install the database on a database called tiger. geocoder seems to be another popular alternative.

export PGUSER=arthgallo
export PGPASSWORD=*****
export PGDATABASE=tiger

Next step is to enter the states you want to load data for. Please enter it here in the format below

#Each state needs to be quoted with single quotes. The following construct
#preserves this arrangement
export geostates="'"NY"'","'"MA"'"
echo $geostates

Next we create the database

createdb tiger

Now lets install the appropriate extensions including PostGIS. If PostGIS is not installed or is not the latest version, please refer to my previous posts. Once we install the Postgis Tiger Geocoder extension, we will also validate if the address can be normalized. Running address normalization only validates that the extension is installed. We are still not done as setting up the actual geocoder involves downloading a lot of data and it takes a fair bit of time based on your internet connection and your processing speed.

psql -c "CREATE EXTENSION postgis"
psql -c "CREATE EXTENSION postgis_topology"
psql -c "CREATE EXTENSION fuzzystrmatch"
psql -c "DROP EXTENSION postgis_tiger_geocoder CASCADE"
psql -c "CREATE EXTENSION postgis_tiger_geocoder"

psql -c "SELECT na.address, na.streetname,na.streettypeabbrev, na.zip
 FROM normalize_address('1 Devonshire Place, Boston, MA 02109') AS na"

Next, we need to set up the automated process that will download the Tiger data from census website and deploy the appropriate geocoders. I have created a SQL script file that does some of the work. It makes the appropriate entries into a table for setting it up according to my environment.

As you can see I have commented out the PGPORT, PGHOST, PGUSER and PGPASSWORD entries since I had already set the user at the beginning of the script, and I want to continue with the same.

Please modify it to match your environment.

-- tiger_loader_platform.sql
INSERT INTO tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep, 
loader, environ_set_command, county_process_command)
 SELECT 'trusty', 'TMPDIR="${staging_fold}/temp/"
 UNZIPTOOL=unzip                   
 WGETTOOL=/usr/bin/wget          
 export PGBIN=/usr/bin   
 # export PGPORT=5432                
 # export PGHOST=localhost           
 # export PGUSER=postgres            
 # export PGPASSWORD=yourpasswordhere
 export PGDATABASE=tiger        
 PSQL=${PGBIN}/psql                
 SHP2PGSQL=${PGBIN}/shp2pgsql      
 cd ${staging_fold}', pgbin, wget, unzip_command, psql, path_sep, 
 loader, environ_set_command, county_process_command
 FROM tiger.loader_platform
 WHERE os = 'sh';

Next we will execute the script to create appropriate entries.

psql  -f tiger_loader_platform.sql

We can test that everything is inserted correctly..

psql -c "select declare_sect from tiger.loader_platform where os='trusty'"


Next we will create a folder to stage the data downloaded from the census ftp site. Please make sure you have enough disk space on your machine. My thumb rule for the process is 1-2 GB per state that you want to setup.

We create a folder called gisdata at root. This is important since the out of the box geocoder install scripts all point to this folder.

# script needs gisdata folder
sudo mkdir /gisdata

We will also need a temp folder inside the gisdata folder.

# and a temp folder inside it
sudo mkdir /gisdata/temp

Next we need to make this folder accessible to the postgres user.

# Need to make it accessible to the postgres user
sudo chmod -R 777 /gisdata

Now lets restart postgres before the fun begins.

#Restart postgres service
sudo service postgresql restart

First step is to generate a script called loader_generate_nation_script. I skipped this step and had to repeat the process multiple times (many wasted hours) before figuring it out. Since this is a SQL command we need to make sure the output is clean enough to be executed, hence some extra formatting commands on the SQL. The default SQL output puts the column name in the output. I renamed it to 'aaaaa' so that I could get rid of it later.


#Run the nation script
psql -c "SELECT loader_generate_nation_script('trusty') AS aaaaa" \
-t -x -A -F "" -q -o 02a_load_nation_tiger.sh

Now that the script is generated, we need to replace aaaaa with blank to make it a proper shell script. We also need to make it executable. Then we need to execute it...

#Replace the text
sed -i 's/aaaaa//g' 02a_load_nation_tiger.sh
chmod a+x ./02a_load_nation_tiger.sh
sudo su postgres ./02a_load_nation_tiger.sh

Next the actual script for loading specific states. I just needed data for New York and Massachusetts. You can modify this to match your criteria.


#Now generate the loader script
psql -c "SELECT loader_generate_script(ARRAY[$geostates], 'trusty') AS aaaaa" -t -x -A -F "" -q -o 02b_load_ny_tiger.sh

Again, we need to replace the dummy text with blanks using sed

#Replace dummy text with blanks
sed -i 's/aaaaa//g' 02b_load_ny_tiger.sh

We need to make the file executable followed by executing the file. Beware this process takes a while to run.

#Execute the load_ny_tiger file we generated
chmod a+x ./02b_load_ny_tiger.sh
sudo su postgres ./02b_load_ny_tiger.sh

The previous step takes a few hours to run depending on the data that needs to be downloaded. We run the following command to install missing indexes.

psql -c "SELECT install_missing_indexes()" 


Now that everything is setup, we can run individual queries. First step is to normalize an address

psql -c "SELECT '30 Rock' As event, n.* FROM normalize_address('45 Rockefeller Plaza, New York, NY 10111') As n"

Here is the output


Next the real test... We try to geocode an address.


psql -c "SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat, (addy).address As stno, (addy).streetname As street, 
(addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As st,(addy).zip FROM geocode('45 Rockefeller Plaza, New York, NY 10111') As g"

Perfect, we found some matches as shown in the screenshot below



Next an intersection geocode

psql -c "SELECT pprint_addy(addy), st_astext(geomout),rating FROM geocode_intersection( 'Fort Hamilton Pkwy','50th Street', 'NY', 'Brooklyn','11219',1)" 


Here is the result


Next without the zipcode

psql -c "SELECT pprint_addy(addy), st_astext(geomout),rating FROM geocode_intersection( 'Fort Hamilton Pkwy','50th Street', 'NY', 'Brooklyn')" 

Here is the result


Not bad...  Here is the complete script for the shell file and SQL script.

First the SQL script

-- tiger_loader_platform.sql
INSERT INTO tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep, 
loader, environ_set_command, county_process_command)
 SELECT 'trusty', 'TMPDIR="${staging_fold}/temp/"
 UNZIPTOOL=unzip                   
 WGETTOOL=/usr/bin/wget          
 export PGBIN=/usr/bin   
 # export PGPORT=5432                
 # export PGHOST=localhost           
 # export PGUSER=postgres            
 # export PGPASSWORD=yourpasswordhere
 export PGDATABASE=tiger        
 PSQL=${PGBIN}/psql                
 SHP2PGSQL=${PGBIN}/shp2pgsql      
 cd ${staging_fold}', pgbin, wget, unzip_command, psql, path_sep, 
 loader, environ_set_command, county_process_command
 FROM tiger.loader_platform
 WHERE os = 'sh';

Next the shell script


# 02_install_tiger_geocoder.sh

export PGUSER=arthgallo
export PGPASSWORD=*****
export PGDATABASE=tiger

#Each state needs to be quoted with single quotes. The following construct
#preserves this arrangement
export geostates="'"NY"'","'"MA"'"
echo $geostates

createdb tiger
 
psql -c "CREATE EXTENSION postgis"
psql -c "CREATE EXTENSION postgis_topology"
psql -c "CREATE EXTENSION fuzzystrmatch"
psql -c "DROP EXTENSION postgis_tiger_geocoder CASCADE"
psql -c "CREATE EXTENSION postgis_tiger_geocoder"

psql -c "SELECT na.address, na.streetname,na.streettypeabbrev, na.zip
 FROM normalize_address('1 Devonshire Place, Boston, MA 02109') AS na"

psql -c "delete from tiger.loader_platform where os='trusty'"

psql  -f tiger_loader_platform.sql

psql -c "select declare_sect from tiger.loader_platform where os='trusty'"

# script needs gisdata folder
sudo mkdir /gisdata

# and a temp folder inside it
sudo mkdir /gisdata/temp

# Need to make it accessible to the postgres user
sudo chmod -R 777 /gisdata

#Restart postgres service
sudo service postgresql restart

#Run the nation script
psql -c "SELECT loader_generate_nation_script('trusty') AS aaaaa" \
-t -x -A -F "" -q -o 02a_load_nation_tiger.sh

#Replace the text
sed -i 's/aaaaa//g' 02a_load_nation_tiger.sh
chmod a+x ./02a_load_nation_tiger.sh
sudo su postgres ./02a_load_nation_tiger.sh

#Now generate the loader script
psql -c "SELECT loader_generate_script(ARRAY[$geostates], 'trusty') AS aaaaa" -t -x -A -F "" -q -o 02b_load_ny_tiger.sh

#Replace dummy text with blanks
sed -i 's/aaaaa//g' 02b_load_ny_tiger.sh

#Execute the load_ny_tiger file we generated
chmod a+x ./02b_load_ny_tiger.sh
sudo su postgres ./02b_load_ny_tiger.sh

psql -c "SELECT install_missing_indexes()" 

#Test address normalization
psql -c "SELECT '30 Rock' As event, n.* FROM normalize_address('45 Rockefeller Plaza, New York, NY 10111') As n"

#Test simple geocode
psql -c "SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat, (addy).address As stno, (addy).streetname As street, 
(addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As st,(addy).zip FROM geocode('45 Rockefeller Plaza, New York, NY 10111') As g"

#Test intersection geocode
psql -c "SELECT pprint_addy(addy), st_astext(geomout),rating FROM geocode_intersection( 'Fort Hamilton Pkwy','50th Street', 'NY', 'Brooklyn','11219',1)" 

#Test intersection geocode without zip
psql -c "SELECT pprint_addy(addy), st_astext(geomout),rating FROM geocode_intersection( 'Fort Hamilton Pkwy','50th Street', 'NY', 'Brooklyn')"

Saturday, August 29, 2015

Upgrading PostgreSQL and PostGIS from 9.3 to 9.4

In this post, we will walk through steps of upgrading from PostgreSQL 9.3 to 9.4. It all happened because my PostGIS version had got corrupted.

When I executed the command to check for my PostGIS version, I got an error saying liblwgeom library could not be loaded

su - postgres -c "psql -c 'SELECT PostGIS_full_version()'"

gave the following error

ERROR:  could not load library "/usr/lib/postgresql/9.3/lib/postgis-2.1.so": liblwgeom-2.1.2.so: cannot open shared object file: No such file or directory


After much research and browsing through forums, I realized it was better to upgrade postgreSQL and PostGIS. I checked the postgres version with the following command to make sure I needed to do that.


I found a great post on stackexchange from someone who had posted their script. With minor changes the script was good. Here is the modified script in its entirety

I executed the following step by step to make sure I could see what was happening...

su - postgres -c "psql -c 'SELECT PostGIS_full_version()'"
su - postgres -c "psql --version"


# Be sure to save your config files.
sudo cp /etc/postgresql/9.3/main/postgresql.conf ~
sudo cp /etc/postgresql/9.3/main/pg_hba.conf ~

# Package repo (for apt-get). I have 14.04 LTS, so I selected trusty-pgdg. Please change to match your version
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main" >> /etc/apt/sources.list.d/postgresql.list'

# Also probably optional but I like to update sources and upgrade
sudo apt-get update
sudo apt-get upgrade

# Install postgres 9.4
sudo apt-get install postgresql-9.4 postgresql-server-dev-9.4 postgresql-contrib-9.4

# Get latest postgis for 9.4 (optional)
sudo apt-get install postgresql-9.4-postgis

# dump your data
sudo su postgres
cd ./data/dumps
/usr/lib/postgresql/9.4/bin/pg_dumpall > pre_upgrade_from_9.3_to_9.4.dump

# I got this error:
# pg_dump: [archiver (db)] query failed: ERROR:  could not load library "/usr/lib/postgresql/9.3/lib/postgis-2.1.so": 
# liblwgeom-2.1.2.so: cannot open shared object file: No such file or directory

#This is the elusive fix to the problem
sudo apt-get install liblwgeom-2.1.3

# Then repeated
sudo su postgres
cd ./data/dumps
/usr/lib/postgresql/9.4/bin/pg_dumpall > pre_upgrade_from_9.3_to_9.4.dump

# Make a data dir for Postgres 9.4
sudo mkdir -p /data/postgres/9.4/main
sudo chown -R postgres:postgres /data/postgres

# Change the 9.4 conf file's data dir to point to /data/postgres/9.4/main
sudo gedit /etc/postgresql/9.4/main/postgresql.conf

# Install 9.4 cluster
sudo /etc/init.d/postgresql stop

# May need to restart machine if could not stop both postgreSQL processes
sudo pg_dropcluster 9.4 main
sudo pg_createcluster -d /data/postgres/9.4/main 9.4 main

# start 9.4 explicitly
sudo /etc/init.d/postgresql start 9.4

# Restore: Make sure to use the 9.4 version of psql
psql -d postgres -p 5433 -f ~/data/dumps/pre_upgrade_from_9.3_to_9.4.dump

# Change port back to 5432 (optional) and the confs back to what they were! (reference previously copied files)
sudo gedit /etc/postgresql/9.4/main/postgresql.conf

# Changed local   all   all   peer -> local   all   all   md5
sudo gedit /etc/postgresql/9.4/main/pg_hba.conf

sudo service postgresql restart 9.4

# Verify your data was properly imported

# Drop old cluster
sudo pg_dropcluster --stop 9.3 main


# Analyze
sudo service postgresql start
psql
>\c your_database
> ANALYZE;

Now on starting I still got 2 postgres instances

We need to delete the old PostgreSQL version using commands below... 

#Delete previous postgres
sudo apt-get purge postgresql-9.3
sudo service postgresql restart 9.4


After the restart, it shows only one service starting..


Now, we can verify if PostGIS is correctly installed

su - postgres -c "psql -c 'SELECT PostGIS_full_version()'"


This shows the correct versions and everything should be fine. Finally to check the version of other extensions as well, we can use the following command

psql -c "SELECT name, default_version,installed_version FROM pg_available_extensions WHERE name LIKE 'postgis%'"


Looks like everything has got upgraded correctly.

Tuesday, August 25, 2015

Plotting R data on an Interactive Google map

In this post, we will see how to plot data on an interactive map using R. We will use Google maps as the base map. Here is the output



You can also switch to street view


Let's get started. Make sure you have installed the required libraries. The new library needed is googleVis which can be installed by launching R in sudo mode and entering the following command

install.packages("googleVis")
q()



First step is to download the data. This previous post shows how to obtain the spatial data used in this example from the NYC website. Now launch R in regular mode and enter the following code to create the map.

First we need to load the required libraries
# Load the needed libraries
library(sp)  # classes for spatial data
library(maptools)
library(rgeos) # and their dependencies
library(rgdal)
library(ggplot2)
library(ggmap)
library(scales)
library(gmodels)
library(googleVis)

Next we need to set the working directory where our scripts are located and declare a variable to folder where our data is located.
# Set default paths for data
setwd("~/Work/Transportation/NYC/scripts")
shpfilepath="../data/shpfiles"

Next we will read the shape files with data. We have two shape files from NYC. First contains only fatal accidents, and second contains injury accidents. We will later combine these files.
# Read injury and fatality statistics
ogrInfo(shpfilepath,"fatality_yearly")
ogrInfo(shpfilepath,"injury_yearly")
ftlty<-readOGR(shpfilepath, "fatality_yearly")
injry<-readOGR(shpfilepath, "injury_yearly")
names(ftlty); dim(ftlty)
names(injry); dim(injry)

Next we need to rename the column names across the two datasets so that they are identical. Following step shows us how to do that. We also will add a new variable that classifies the incident as an injury or a fatality incident.
#Classify severity and rename columns to be identical in fatalities and injury datasets
ftlty$Severity<-"Fatality"
names(ftlty)[names(ftlty)=="Fatalities"]<-"Total"
names(ftlty)[names(ftlty)=="PedFatalit"]<-"Ped"
names(ftlty)[names(ftlty)=="BikeFatali"]<-"Bike"
names(ftlty)[names(ftlty)=="MVOFatalit"]<-"MVO"
injry$Severity<-"Injury"
names(injry)[names(injry)=="Injuries"]<-"Total"
names(injry)[names(injry)=="PedInjurie"]<-"Ped"
names(injry)[names(injry)=="BikeInjuri"]<-"Bike"
names(injry)[names(injry)=="MVOInjurie"]<-"MVO"
names(injry);names(ftlty)

Now we can combine the two datasets to create one single dataset called acc.

#Combine datasets
acc<- rbind(ftlty, injry)
dim(acc);names(acc);head(acc)

We need to use the fortify command to make sure the coordinate information can be accessed by spatial functions.

#Now fortify the accident data
acc.f<-fortify(as.data.frame(acc))
dim(acc.f);names(acc.f);head(acc.f)

We only want to plot 2015 data. The following command shows us how to filter only 2015 records from the data.

#Subsetting only 2015 accident records
acc_2015<-acc.f[which(acc.f$YR==2015),] 

Now we are ready to plot this on the map. The gvisMap function expects a single field with latitude and longitude information in lat:long format, that is separated by a colon. We don't need more than 4 digit precision so we will round the lat long information and use the paste command to concatenate everything into a single string.

#Add a single column "LatLong" that is of the format lat:long for the accident location
acc_2015$LatLong<-paste(round(acc_2015$coords.x2,4),round(acc_2015$coords.x1,4),sep=":")

The gVisMap function also expects a data field which will be displayed as a map tip. We need to concatenate all available fields with html break symbol to create one single field that can pop up when we hover or click on the icon.

#Add a single column "Tip" for Accident descriptions separated by html breaks ("<br/>")
acc_2015$Tip<-paste("Type: ",acc_2015$Severity,"Total: ",acc_2015$Total,"Ped: ",acc_2015$Ped,"Bike: ",acc_2015$Bike,"MVO: ",acc_2015$MVO,sep="<br/>")
head(acc_2015)

Finally, we call the gVisMap function passing the dataset, name of concatenated lat long column, name of the Map tip column and set several defaults for the Google map. Calling the plot function launches an embedded http server and launches the default browser on your machine to display the generated html page.

#Plot 2015 accidents on Google
map1<-gvisMap(acc_2015, "LatLong", "Tip", 
              options=list(showTip=TRUE, showLine=F, enableScrollWheel=TRUE, 
                           mapType='roads', useMapTypeControl=TRUE, width=800,height=1024))
plot(map1)

That's it... We are done. Here is the output


Here is the complete source code
# Load the needed libraries
library(sp)  # classes for spatial data
library(maptools)
library(rgeos) # and their dependencies
library(rgdal)
library(ggplot2)
library(ggmap)
library(scales)
library(gmodels)
library(googleVis)

# Set default paths for data
setwd("~/Work/Transportation/NYC/scripts")
shpfilepath="../data/shpfiles"

# Read injury and fatality statistics
ogrInfo(shpfilepath,"fatality_yearly")
ogrInfo(shpfilepath,"injury_yearly")
ftlty<-readOGR(shpfilepath, "fatality_yearly")
injry<-readOGR(shpfilepath, "injury_yearly")
names(ftlty); dim(ftlty)
names(injry); dim(injry)

#Classify severity and rename columns to be identical in fatalities and injury datasets
ftlty$Severity<-"Fatality"
names(ftlty)[names(ftlty)=="Fatalities"]<-"Total"
names(ftlty)[names(ftlty)=="PedFatalit"]<-"Ped"
names(ftlty)[names(ftlty)=="BikeFatali"]<-"Bike"
names(ftlty)[names(ftlty)=="MVOFatalit"]<-"MVO"
injry$Severity<-"Injury"
names(injry)[names(injry)=="Injuries"]<-"Total"
names(injry)[names(injry)=="PedInjurie"]<-"Ped"
names(injry)[names(injry)=="BikeInjuri"]<-"Bike"
names(injry)[names(injry)=="MVOInjurie"]<-"MVO"
names(injry);names(ftlty)

#Combine datasets
acc<- rbind(ftlty, injry)
dim(acc);names(acc);head(acc)

#Now fortify the accident data
acc.f<-fortify(as.data.frame(acc))
dim(acc.f);names(acc.f);head(acc.f)

#Subsetting only 2015 accident records
acc_2015<-acc.f[which(acc.f$YR==2015),] 

#Add a single column "LatLong" that is of the format lat:long for the accident location
acc_2015$LatLong<-paste(round(acc_2015$coords.x2,4),round(acc_2015$coords.x1,4),sep=":")

#Add a single column "Tip" for Accident descriptions separated by html breaks ("<br/>")
acc_2015$Tip<-paste("Type: ",acc_2015$Severity,"Total: ",acc_2015$Total,"Ped: ",acc_2015$Ped,"Bike: ",acc_2015$Bike,"MVO: ",acc_2015$MVO,sep="<br/>")
head(acc_2015)

#Plot 2015 accidents on Google
map1<-gvisMap(acc_2015, "LatLong", "Tip", 
              options=list(showTip=TRUE, showLine=F, enableScrollWheel=TRUE, 
                           mapType='roads', useMapTypeControl=TRUE, width=800,height=1024))
plot(map1)

Steps for installing dismo in R

Steps for installing dismo in your R distribution are fairly straight forward.

First login to R using sudo, so your package can be installed to the shared repository

sudo R




Next, enter the following command in R prompt

install.packages("dismo")



By default, R prompts for https mirrors. For some reason, https mirrors gave me errors. For this reason select the option for HTTP mirrors at the bottom of the list, as shown below.


Next, choose your preferred mirror from the list of http mirrors, as shown below


Click OK to install. R will download the packages and install as neccessary.



Once the installation, has finished load the library using the following command and we are good to go.

library(dismo)



That's it...

Saturday, August 22, 2015

Performing Spatial Overlay and CrossTab in R

In this post, we will look at spatial overlay in R using a simple example.

Here are the results we will produce



In previous posts, we had seen how to render multiple layers in R using ggplot2. This post will use the same dataset but for a different purpose.

First step is to download the data. This previous post shows how to obtain the spatial data used in this example from the NYC website.

Launch R in a terminal and load the required libraries. I have posted in previous posts on how to install these libraries. This post describes rgdal and associated packages' installation. This post describes gmodels package installation.


# Load required libraries
library(rgdal)
library(ggplot2)
library(ggmap)
library(rgeos)
library(maptools)
library(scales)
library(gmodels)


Next we set the default paths for the data. Please modify this accordingly for your environment

# Set default paths for data
setwd("~/Work/Transportation/NYC/scripts")
shpfilepath="../data/shpfiles"

Now we can read the shapefiles into our workspace memory for analysis

# Read injury and fatality statistics
ogrInfo(shpfilepath,"fatality_yearly")
ogrInfo(shpfilepath,"injury_yearly")
ftlty_yr<-readOGR(shpfilepath, "fatality_yearly")
injry_yr<-readOGR(shpfilepath, "injury_yearly")
names(ftlty_yr); dim(ftlty_yr)
names(injry_yr); dim(injry_yr)

Successful reading will show the following results.


As we can see from the names description, the data does not have any spatial information attached. We need to use the fortify command to attach the spatial coordinates to the data we have.

# Add the spatial data to the data frame
ftlty.f <-fortify(as.data.frame(ftlty_yr))
names(ftlty.f);dim(ftlty.f)

Using the fortify command attaches the spatial data to our previous structure as can be seen in the console output.



Next we look at the New York City Borough boundaries that we need to use.

#Load boroughs data
ogrInfo(shpfilepath,"nybbwi")
bbnd <- readOGR(shpfilepath,"nybbwi")


As we can see the borough layer has lambert conformal conic coordinate reference system, which means that the fatality data and boundaries will not coincide. We will use the spTransform function to transform the boundaries from lambert conformal to lat long.

#Transform the coordinates from Lambert Conformal to Lat Long
CRS.latlong<-CRS("+proj=longlat +datum=WGS84 +no_defs")
bbnd<-spTransform(bbnd, CRS.latlong)
summary(bbnd)


As we can see the data has been transformed. We need to also fortify the boundary data to attach the spatial information.

# Add the spatial data back to the boundary data.frame
bbnd.f <-fortify(bbnd, region="BoroCode")
names(bbnd.f);head(bbnd.f);dim(bbnd.f)



Now that we have the geometric information attached, the number of rows increases to include each part of each geometric record. We need to re-attach the attribute information for each geometric part, using the merge command.

# Combine the geometry and data back together using common keys
bbnd.f<-merge(bbnd.f,bbnd@data, by.x="id", by.y="BoroCode")
names(bbnd.f);head(bbnd.f);dim(bbnd.f)



As we can see the information has been attached, and now we have 7+3 = 10 columns of data for the polygon dataset. Now the overlay. We will attach each fatality record the borough it falls in. The over() function allows that. We will also take this information and add it to two new variables in the ftlty.f data frame. R will take care of altering the structure of the table to accommodate the new data points.

#Code each record in fatality with the borough it falls under and add to new variable
ftlty.f$BoroCode <-as.data.frame(over(ftlty_yr, bbnd))$BoroCode
ftlty.f$BoroName <-as.data.frame(over(ftlty_yr, bbnd))$BoroName
names(ftlty.f);head(ftlty.f);dim(ftlty.f)



As we can see new information has been attached. Each accident record has been coded with the borough it falls in. We can infact use the following code to plot it to a map.

#Plot accidents coded by the borough they fall in
fx<-ftlty.f$coords.x1
fy<-ftlty.f$coords.x2
Map <- ggplot(data=ftlty.f, aes(fx,fy,colour=factor(BoroCode)))
Map <- Map + layer(geom="point")
Map

The following map is generated


Next we will use the CrossTable function to look at number of accidents that involved one or more fatalities
We can instead also build a table to summarize fatalities by borough and year.

#Build a cross table for Borough name and Year with count of fatal incidents
ftlctab<-CrossTable(ftlty.f$BoroName,ftlty.f$YR,expected=FALSE)
ftlctab$t

We can also build a cross table to look at number of fatalities using the xtabs function.

#Build a cross table for Borough name and Year with fatalities
ftlctab2<-xtabs(ftlty.f$Fatalities ~  ftlty.f$BoroName+ftlty.f$YR,ftlty.f)
ftlctab2

We can actually compare the results in the output above to see the differences.


We can also plot the above in barplots using the barplot command.

#Plot this on a barplot to see the trends for Incidents with fatalities
barplot(ftlctab$t 
 ,col=c("red","orange","yellow","green","blue", "darkblue","grey")
 ,names.arg=levels(ftlty.f$YR)
 ,legend=rownames(ftlty.f$BoroName)
)


Now we could also plot the sum of fatalities instead of fatal incidents by using the ftlctab2 variable instead.

#Plot this on a barplot to see the trends for Fatalities
barplot(ftlctab2
 ,col=c("red","orange","yellow","green","blue", "darkblue","grey")
 ,names.arg=levels(ftlty.f$YR)
 ,legend=levels(ftlty.f$BoroName)
)



That's it. Here is the complete source code

# Load required libraries
library(rgdal)
library(ggplot2)
library(ggmap)
library(rgeos)
library(maptools)
library(scales)
library(gmodels)

# Set default paths for data
setwd("~/Work/Transportation/NYC/scripts")
shpfilepath="../data/shpfiles"

# Read injury and fatality statistics
ogrInfo(shpfilepath,"fatality_yearly")
ogrInfo(shpfilepath,"injury_yearly")
ftlty_yr<-readOGR(shpfilepath, "fatality_yearly")
injry_yr<-readOGR(shpfilepath, "injury_yearly")
names(ftlty_yr); dim(ftlty_yr)
names(injry_yr); dim(injry_yr)

# Add the spatial data to the data frame
ftlty.f <-fortify(as.data.frame(ftlty_yr))
names(ftlty.f);dim(ftlty.f)

#Load boroughs data
ogrInfo(shpfilepath,"nybbwi")
bbnd <- readOGR(shpfilepath,"nybbwi")

#Transform the coordinates from Lambert Conformal to Lat Long
CRS.latlong<-CRS("+proj=longlat +datum=WGS84 +no_defs")
bbnd<-spTransform(bbnd, CRS.latlong)

# Add the spatial data back to the boundary data.frame
bbnd.f <-fortify(bbnd, region="BoroCode")
names(bbnd.f);head(bbnd.f);dim(bbnd.f)

# Combine the geometry and data back together using common keys
bbnd.f<-merge(bbnd.f,bbnd@data, by.x="id", by.y="BoroCode")
names(bbnd.f);head(bbnd.f);dim(bbnd.f)

#Code each record in fatality with the borough it falls under and add to new variable
ftlty.f$BoroCode <-as.data.frame(over(ftlty_yr, bbnd))$BoroCode
ftlty.f$BoroName <-as.data.frame(over(ftlty_yr, bbnd))$BoroName
names(ftlty.f);head(ftlty.f);dim(ftlty.f)

#Plot accidents coded by the borough they fall in
fx<-ftlty.f$coords.x1
fy<-ftlty.f$coords.x2
Map <- ggplot(data=ftlty.f, aes(fx,fy,colour=factor(BoroCode)))
Map <- Map + layer(geom="point")
Map

#Build a cross table for Borough name and Year with count of fatal incidents
ftlctab<-CrossTable(ftlty.f$BoroName,ftlty.f$YR,expected=FALSE)
ftlctab$t

#Build a cross table for Borough name and Year with fatalities
ftlctab2<-xtabs(ftlty.f$Fatalities ~  ftlty.f$BoroName+ftlty.f$YR,ftlty.f)
ftlctab2

#Plot this on a barplot to see the trends for Incidents with fatalities
barplot(ftlctab$t 
 ,col=c("red","orange","yellow","green","blue", "darkblue","grey")
 ,names.arg=levels(ftlty.f$YR)
 ,legend=rownames(ftlty.f$BoroName)
)

#Plot this on a barplot to see the trends for Fatalities
barplot(ftlctab2
 ,col=c("red","orange","yellow","green","blue", "darkblue","grey")
 ,names.arg=levels(ftlty.f$YR)
 ,legend=levels(ftlty.f$BoroName)
)

Installing gmodels library in R

Steps for installing gmodels in your R distribution are fairly straight forward.

First login to R using sudo, so your package can be installed to the shared repository

sudo R




Next, enter the following command in R prompt

install.packages("gmodels")



By default, R prompts for https mirrors. For some reason, https mirrors gave me errors. For this reason select the option for HTTP mirrors at the bottom of the list, as shown below.


Next, choose your preferred mirror from the list of http mirrors, as shown below


Click OK to install. R will download the packages and install as neccessary.



Once the installation, has finished load the library using the following command and we are good to go.

library(gmodels)



That's it...