If you find this page useful and would like to be notified of changes made to this page, start by inputting your email below.
powered by ChangeDetection
| Line 31: | Line 31: | ||
(17000 rows inserted) | (17000 rows inserted) | ||
| − | |||
| Line 144: | Line 143: | ||
==ab_operator== | ==ab_operator== | ||
| − | Excel file was obtained from Business Associate Codes (ST104A) on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared. | + | The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared. |
| + | |||
| + | |||
| + | Excel formulas extracted the address components. The columns in the Excel sheet were as follows: | ||
| + | |||
| + | |||
| + | A: BA Code | ||
| + | |||
| + | B: Company Name | ||
| + | |||
| + | C: Phone | ||
| − | + | D: Address Long | |
| + | E: Address (i.e. house/box number) | ||
| − | + | F: Province | |
| + | G: Postal Code | ||
| − | + | H: Country | |
| + | I: Position of Last Comma in D | ||
| − | + | J: Last Word Before Province (i.e. last word of city name for multi-word cities, or full city name for single-word cities) | |
| − | + | K: City Name Long (only filled for multi-word cities) | |
| + | L: City | ||
| − | |||
| − | + | The formulas were: | |
| + | * Last Comma (I): =IF(ISERROR(SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",",""))))),"",SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",",""))))) | ||
| − | + | * Province (F): =IF(ISERROR(MID(D2,I2+2,2)),"",MID(D2,I2+2,2)) | |
| + | * Postal Code (G): =IF(ISERROR(MID(D2,I2+5,LEN(D2)-I2-4)),"",MID(D2,I2+5,LEN(D2)-I2-4)) | ||
| − | =IF(ISERROR( | + | * Country (H): =IF(NOT(ISERROR(VLOOKUP($F2, [provinces.xlsx]Sheet1!$B$2:$B$14,1, FALSE))), "Canada", IF(NOT(ISERROR(VLOOKUP($F2, [states.xlsx]Sheet1!$B$2:$B$51,1, FALSE))), "United States", "" )) |
| + | * Last Word Before Province (J): =IF(ISERROR(RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ","")))))),"",RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ","")))))) | ||
| − | From here, data was sorted based on province/state and a manual search located multi-word cities, whose proper names were then recorded in K2. The true city name was inserted into L2 by =IF(K2="",J2,K2). | + | * From here, data was sorted based on province/state and a manual search located multi-word cities, whose proper names were then recorded in K2. The true city name was inserted into L2 by =IF(K2="",J2,K2). |
| + | |||
| + | * Address (E): =IF(ISERROR(TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))),"",TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))) - removal of all components from the long address to produce the short address | ||
The final Excel file was saved as a CSV. | The final Excel file was saved as a CSV. | ||
| + | |||
| + | |||
| + | ab_operator_city_names.xlsx has been added to WellWiki Data on Google Drive to assist with the process of assigning city names in the future. | ||
| Line 204: | Line 224: | ||
ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country | ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country | ||
| − | |||
| Line 221: | Line 240: | ||
Some blank strings where still present, so the following were used as well: | Some blank strings where still present, so the following were used as well: | ||
| − | |||
| Line 235: | Line 253: | ||
| − | For consistency of data: | + | ''For consistency of data:'' |
Tables are backed up using mysqldump to ~/database_backups/ before starting.
CSVs are copied to the server. Their group and ownership are changed to 'mysql' using the chgrp and chown commands. They are then moved to /var/lib/mysql/wellsdb_dev/ such that they can be read by mysql. Ensure that CSVs are free of special characters (e.g. \n and \) and extra quotation marks ".
The Reserves Sequence Code column was deleted from Excel document because it is not included in our database table. 524 duplicate rows were then deleted and the file saved as a Windows Comma Separated (.csv).
Importing:
CREATE TABLE ab_pool_codes_tmp AS (SELECT * FROM ab_pool_codes where 0=1)
LOAD DATA INFILE 'CommingledPoolList.csv'
INTO TABLE ab_pool_codes_tmp FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(field_name, production_pool, field_code, production_pool_code, @dummy, geological_pool_code, geological_pool_name, confidential)
(17000 rows inserted)
INSERT INTO ab_pool_codes
SELECT * FROM ab_pool_codes_tmp t2
WHERE (t2.field_name, t2.production_pool, t2.field_code, t2.production_pool_code, t2.geological_pool_code, t2.geological_pool_name, t2.confidential)
NOT IN (SELECT field_name, production_pool, field_code, production_pool_code, geological_pool_code, geological_pool_name, confidential FROM ab_pool_codes)
(17000 rows added - there were no duplicates between the old and new data)
DROP TABLE ab_pool_codes_tmp
Setting blank fields to NULL:
UPDATE ab_pool_codes
SET geological_pool_name=NULL
WHERE geological_pool_name=
UPDATE ab_pool_codes
SET geological_pool_name=NULL
WHERE LENGTH(geological_pool_name)=1
UPDATE ab_pool_codes
SET confidential=NULL
WHERE LENGTH(confidential)=1
Text files were retrieved from http://www.aer.ca/data-and-publications/statistical-reports/st102, opened in Excel, and saved as CSV.
Importing:
CREATE TABLE ab_facility_tmp AS (SELECT * FROM ab_facility where 0=1)
LOAD DATA INFILE 'ActiveFacility.csv'
INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)
LOAD DATA INFILE 'InactiveFacility.csv'
INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)
(39314 + 66767 = 106081 rows inserted)
INSERT INTO ab_facility
SELECT * FROM ab_facility_tmp t2
WHERE (t2.FACILITY_ID, t2.FACILITY_NAME, t2.OPERATOR_CODE, t2.OPERATOR_NAME, t2.SUB_TYPE_CODE, t2.SUB_TYPE, t2.LE, t2.LSD, t2.SEC, t2.TWP, t2.RNG, t2.MER, t2.LICENSE_NUMBER, t2.EDCT_CODE, t2.EDCT_DESCRIPTION, t2.LICENSE_CODE, t2.OPERATIONAL_STATUS)
NOT IN (SELECT FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS FROM ab_facility)
(87923 rows added)
DROP TABLE ab_facility_tmp
Setting blank fields to NULL:
The following query was run to replace blanks with NULLs in all columns except well_id:
UPDATE ab_facility
SET FACILITY_ID=NULL
WHERE LENGTH(FACILITY_ID)=0
Similar operations where performed to replace blank fields with a length of 1 (taking care not to erase any real single digit values). Columns affected were LSD, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, and LICENSE_CODE.
The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared.
Excel formulas extracted the address components. The columns in the Excel sheet were as follows:
A: BA Code
B: Company Name
C: Phone
D: Address Long
E: Address (i.e. house/box number)
F: Province
G: Postal Code
H: Country
I: Position of Last Comma in D
J: Last Word Before Province (i.e. last word of city name for multi-word cities, or full city name for single-word cities)
K: City Name Long (only filled for multi-word cities)
L: City
The formulas were:
The final Excel file was saved as a CSV.
ab_operator_city_names.xlsx has been added to WellWiki Data on Google Drive to assist with the process of assigning city names in the future.
Importing:
ab_operator_test is an intermediate table used to avoid duplicate key error #1062.
create table ab_operator_test select * from ab_operator where 1=0
LOAD DATA INFILE 'BusinessAssociateCodes_proper_cols.csv'
INTO TABLE ab_operator_test FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(ba_code, company, phone, @dummy, address, province, postal_code, country, @dummy, @dummy, @dummy, city)
INSERT INTO ab_operator (ba_code, company, address, city, province, postal_code, phone, country)
SELECT ba_code, company, address, city, province, postal_code, phone, country FROM ab_operator_test t2
ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country
DROP TABLE ab_operator_test
Setting blank fields to NULL:
UPDATE ab_operator
SET city=NULL
WHERE city=
The same operation was performed on province, postal_code, phone, and country.
Some blank strings where still present, so the following were used as well:
UPDATE ab_operator
SET city=NULL
WHERE length(city)<2
UPDATE ab_operator SET phone = REPLACE(REPLACE(phone, '\r', ), '\n', )
(followed by a repeat of the initial update statement)
For consistency of data:
update ab_operator
set country='United States'
where country='US'