X

Track changes made to this page

If you find this page useful and would like to be notified of changes made to this page, start by inputting your email below.



Privacy policy
Close this window

powered by ChangeDetection

Difference between revisions of "Alberta Technical Information"

Line 31: Line 31:
  
 
(17000 rows inserted)
 
(17000 rows inserted)
 
  
  
Line 144: Line 143:
 
==ab_operator==
 
==ab_operator==
  
Excel file was obtained from Business Associate Codes (ST104A) on http://www.aer.ca/data-and-publications/statistical-reports/st104.  Originally, the address was given in a single column and contained the street number, city, province, and postal code.  These individual components were extracted into separate fields.  Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared.
+
The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104.  Originally, the address was given in a single column and contained the street number, city, province, and postal code.  These individual components were extracted into separate fields.  Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared.
 +
 
 +
 
 +
Excel formulas extracted the address components.  The columns in the Excel sheet were as follows:
 +
 
 +
 
 +
A: BA Code
 +
 
 +
B: Company Name
 +
 
 +
C: Phone
  
The following formulas extracted the address components:
+
D: Address Long
  
 +
E: Address (i.e. house/box number)
  
province: =IF(ISERROR(MID(D2,I2+2,2)),"",MID(D2,I2+2,2)), where D2 was the full address and I2 was the position of the last comma.
+
F: Province
  
 +
G: Postal Code
  
postal code: =IF(ISERROR(MID(D2,I2+5,LEN(D2)-I2-4)),"",MID(D2,I2+5,LEN(D2)-I2-4))
+
H: Country
  
 +
I: Position of Last Comma in D
  
country: =IF(NOT(ISERROR(VLOOKUP($F2, [provinces.xlsx]Sheet1!$B$2:$B$14,1, FALSE))), "Canada", IF(NOT(ISERROR(VLOOKUP($F2, [states.xlsx]Sheet1!$B$2:$B$51,1, FALSE))), "United States", "" ))
+
J: Last Word Before Province (i.e. last word of city name for multi-word cities, or full city name for single-word cities)
  
where province abbreviations were held in B2:B14 of provinces.xlsx and state abbreviations were held in B2:B51 of states.xlsx.
+
K: City Name Long (only filled for multi-word cities)
  
 +
L: City
  
address: =IF(ISERROR(TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))),"",TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,"")))
 
  
where L2 contained the city/town name. 
+
The formulas were:
  
 +
* Last Comma (I): =IF(ISERROR(SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",",""))))),"",SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",","")))))
  
The last word of the city name was extracted into J2 by:
+
* Province (F): =IF(ISERROR(MID(D2,I2+2,2)),"",MID(D2,I2+2,2))
  
 +
* Postal Code (G): =IF(ISERROR(MID(D2,I2+5,LEN(D2)-I2-4)),"",MID(D2,I2+5,LEN(D2)-I2-4))
  
=IF(ISERROR(RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ","")))))),"",RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ",""))))))
+
* Country (H): =IF(NOT(ISERROR(VLOOKUP($F2, [provinces.xlsx]Sheet1!$B$2:$B$14,1, FALSE))), "Canada", IF(NOT(ISERROR(VLOOKUP($F2, [states.xlsx]Sheet1!$B$2:$B$51,1, FALSE))), "United States", "" ))
  
 +
* Last Word Before Province (J): =IF(ISERROR(RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ","")))))),"",RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ",""))))))
  
From here, data was sorted based on province/state and a manual search located multi-word cities, whose proper names were then recorded in K2.  The true city name was inserted into L2 by =IF(K2="",J2,K2).
+
* From here, data was sorted based on province/state and a manual search located multi-word cities, whose proper names were then recorded in K2.  The true city name was inserted into L2 by =IF(K2="",J2,K2).
 +
 
 +
* Address (E): =IF(ISERROR(TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))),"",TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))) - removal of all components from the long address to produce the short address
  
  
 
The final Excel file was saved as a CSV.
 
The final Excel file was saved as a CSV.
 +
 +
 +
ab_operator_city_names.xlsx has been added to WellWiki Data on Google Drive to assist with the process of assigning city names in the future.
  
  
Line 204: Line 224:
  
 
ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country
 
ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country
 
  
  
Line 221: Line 240:
  
 
Some blank strings where still present, so the following were used as well:
 
Some blank strings where still present, so the following were used as well:
 
  
  
Line 235: Line 253:
  
  
For consistency of data:
+
''For consistency of data:''
  
  

Revision as of 06:57, 14 July 2015

General

Tables are backed up using mysqldump to ~/database_backups/ before starting.

CSVs are copied to the server. Their group and ownership are changed to 'mysql' using the chgrp and chown commands. They are then moved to /var/lib/mysql/wellsdb_dev/ such that they can be read by mysql. Ensure that CSVs are free of special characters (e.g. \n and \) and extra quotation marks ".

ab_pool_codes

The Reserves Sequence Code column was deleted from Excel document because it is not included in our database table. 524 duplicate rows were then deleted and the file saved as a Windows Comma Separated (.csv).


Importing:


CREATE TABLE ab_pool_codes_tmp AS (SELECT * FROM ab_pool_codes where 0=1)


LOAD DATA INFILE 'CommingledPoolList.csv'

INTO TABLE ab_pool_codes_tmp FIELDS TERMINATED BY ','


ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(field_name, production_pool, field_code, production_pool_code, @dummy, geological_pool_code, geological_pool_name, confidential)


(17000 rows inserted)


INSERT INTO ab_pool_codes

SELECT * FROM ab_pool_codes_tmp t2

WHERE (t2.field_name, t2.production_pool, t2.field_code, t2.production_pool_code, t2.geological_pool_code, t2.geological_pool_name, t2.confidential)

NOT IN (SELECT field_name, production_pool, field_code, production_pool_code, geological_pool_code, geological_pool_name, confidential FROM ab_pool_codes)


(17000 rows added - there were no duplicates between the old and new data)


DROP TABLE ab_pool_codes_tmp


Setting blank fields to NULL:


UPDATE ab_pool_codes

SET geological_pool_name=NULL

WHERE geological_pool_name=


UPDATE ab_pool_codes

SET geological_pool_name=NULL

WHERE LENGTH(geological_pool_name)=1


UPDATE ab_pool_codes

SET confidential=NULL

WHERE LENGTH(confidential)=1

ab_facility

Text files were retrieved from http://www.aer.ca/data-and-publications/statistical-reports/st102, opened in Excel, and saved as CSV.


Importing:


CREATE TABLE ab_facility_tmp AS (SELECT * FROM ab_facility where 0=1)


LOAD DATA INFILE 'ActiveFacility.csv'

INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)


LOAD DATA INFILE 'InactiveFacility.csv'

INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)


(39314 + 66767 = 106081 rows inserted)


INSERT INTO ab_facility

SELECT * FROM ab_facility_tmp t2

WHERE (t2.FACILITY_ID, t2.FACILITY_NAME, t2.OPERATOR_CODE, t2.OPERATOR_NAME, t2.SUB_TYPE_CODE, t2.SUB_TYPE, t2.LE, t2.LSD, t2.SEC, t2.TWP, t2.RNG, t2.MER, t2.LICENSE_NUMBER, t2.EDCT_CODE, t2.EDCT_DESCRIPTION, t2.LICENSE_CODE, t2.OPERATIONAL_STATUS)

NOT IN (SELECT FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS FROM ab_facility)


(87923 rows added)


DROP TABLE ab_facility_tmp


Setting blank fields to NULL:


The following query was run to replace blanks with NULLs in all columns except well_id:

UPDATE ab_facility

SET FACILITY_ID=NULL

WHERE LENGTH(FACILITY_ID)=0


Similar operations where performed to replace blank fields with a length of 1 (taking care not to erase any real single digit values). Columns affected were LSD, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, and LICENSE_CODE.

ab_operator

The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared.


Excel formulas extracted the address components. The columns in the Excel sheet were as follows:


A: BA Code

B: Company Name

C: Phone

D: Address Long

E: Address (i.e. house/box number)

F: Province

G: Postal Code

H: Country

I: Position of Last Comma in D

J: Last Word Before Province (i.e. last word of city name for multi-word cities, or full city name for single-word cities)

K: City Name Long (only filled for multi-word cities)

L: City


The formulas were:

  • Last Comma (I): =IF(ISERROR(SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",",""))))),"",SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",","")))))
  • Province (F): =IF(ISERROR(MID(D2,I2+2,2)),"",MID(D2,I2+2,2))
  • Postal Code (G): =IF(ISERROR(MID(D2,I2+5,LEN(D2)-I2-4)),"",MID(D2,I2+5,LEN(D2)-I2-4))
  • Country (H): =IF(NOT(ISERROR(VLOOKUP($F2, [provinces.xlsx]Sheet1!$B$2:$B$14,1, FALSE))), "Canada", IF(NOT(ISERROR(VLOOKUP($F2, [states.xlsx]Sheet1!$B$2:$B$51,1, FALSE))), "United States", "" ))
  • Last Word Before Province (J): =IF(ISERROR(RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ","")))))),"",RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ",""))))))
  • From here, data was sorted based on province/state and a manual search located multi-word cities, whose proper names were then recorded in K2. The true city name was inserted into L2 by =IF(K2="",J2,K2).
  • Address (E): =IF(ISERROR(TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))),"",TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))) - removal of all components from the long address to produce the short address


The final Excel file was saved as a CSV.


ab_operator_city_names.xlsx has been added to WellWiki Data on Google Drive to assist with the process of assigning city names in the future.


Importing:


ab_operator_test is an intermediate table used to avoid duplicate key error #1062.


create table ab_operator_test select * from ab_operator where 1=0


LOAD DATA INFILE 'BusinessAssociateCodes_proper_cols.csv'

INTO TABLE ab_operator_test FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(ba_code, company, phone, @dummy, address, province, postal_code, country, @dummy, @dummy, @dummy, city)


INSERT INTO ab_operator (ba_code, company, address, city, province, postal_code, phone, country)

SELECT ba_code, company, address, city, province, postal_code, phone, country FROM ab_operator_test t2

ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country


DROP TABLE ab_operator_test


Setting blank fields to NULL:


UPDATE ab_operator SET city=NULL WHERE city=

The same operation was performed on province, postal_code, phone, and country.


Some blank strings where still present, so the following were used as well:


UPDATE ab_operator

SET city=NULL

WHERE length(city)<2


UPDATE ab_operator SET phone = REPLACE(REPLACE(phone, '\r', ), '\n', ) (followed by a repeat of the initial update statement)


For consistency of data:


update ab_operator

set country='United States'

where country='US'