Difference between revisions of "Alberta Technical Information"

Revision as of 15:32, 14 July 2015

General

Tables are backed up using mysqldump to ~/database_backups/ before starting.

CSVs are copied to the server. Their group and ownership are changed to 'mysql' using the chgrp and chown commands. They are then moved to /var/lib/mysql/wellsdb_dev/ such that they can be read by mysql. Ensure that CSVs are free of special characters (e.g. \n and \) and extra quotation marks ".

ab_pool_codes

The Reserves Sequence Code column was deleted from Excel document because it is not included in our database table. 524 duplicate rows were then deleted and the file saved as a Windows Comma Separated (.csv).

Importing:

CREATE TABLE ab_pool_codes_tmp AS (SELECT * FROM ab_pool_codes where 0=1)

LOAD DATA INFILE 'CommingledPoolList.csv'

INTO TABLE ab_pool_codes_tmp FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(field_name, production_pool, field_code, production_pool_code, @dummy, geological_pool_code, geological_pool_name, confidential)

(17000 rows inserted)

INSERT INTO ab_pool_codes

SELECT * FROM ab_pool_codes_tmp t2

WHERE (t2.field_name, t2.production_pool, t2.field_code, t2.production_pool_code, t2.geological_pool_code, t2.geological_pool_name, t2.confidential)

NOT IN (SELECT field_name, production_pool, field_code, production_pool_code, geological_pool_code, geological_pool_name, confidential FROM ab_pool_codes)

(17000 rows added - there were no duplicates between the old and new data)

DROP TABLE ab_pool_codes_tmp

Setting blank fields to NULL:

UPDATE ab_pool_codes

SET geological_pool_name=NULL

WHERE geological_pool_name=

UPDATE ab_pool_codes

SET geological_pool_name=NULL

WHERE LENGTH(geological_pool_name)=1

UPDATE ab_pool_codes

SET confidential=NULL

WHERE LENGTH(confidential)=1

ab_facility

Text files were retrieved from http://www.aer.ca/data-and-publications/statistical-reports/st102, opened in Excel, and saved as CSV.

Importing:

CREATE TABLE ab_facility_tmp AS (SELECT * FROM ab_facility where 0=1)

LOAD DATA INFILE 'ActiveFacility.csv'

INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)

LOAD DATA INFILE 'InactiveFacility.csv'

INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)

(39314 + 66767 = 106081 rows inserted)

INSERT INTO ab_facility

SELECT * FROM ab_facility_tmp t2

WHERE (t2.FACILITY_ID, t2.FACILITY_NAME, t2.OPERATOR_CODE, t2.OPERATOR_NAME, t2.SUB_TYPE_CODE, t2.SUB_TYPE, t2.LE, t2.LSD, t2.SEC, t2.TWP, t2.RNG, t2.MER, t2.LICENSE_NUMBER, t2.EDCT_CODE, t2.EDCT_DESCRIPTION, t2.LICENSE_CODE, t2.OPERATIONAL_STATUS)

NOT IN (SELECT FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS FROM ab_facility)

(87923 rows added)

DROP TABLE ab_facility_tmp

Setting blank fields to NULL:

The following query was run to replace blanks with NULLs in all columns except well_id:

UPDATE ab_facility

SET FACILITY_ID=NULL

WHERE LENGTH(FACILITY_ID)=0

Similar operations where performed to replace blank fields with a length of 1 (taking care not to erase any real single digit values). Columns affected were LSD, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, and LICENSE_CODE.

ab_operator

The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ".", "-", "1", "na", "n/a", "none", "unknown", "not found", and "no address found" were cleared, as well as one case of city and postal codes being "unknown".

Excel formulas extracted the address components. The columns in the Excel sheet were as follows:

A: BA Code

B: Company Name

C: Phone

D: Address Long

E: Address (i.e. house/box number)

F: Province

G: Postal Code

H: Country

I: Position of Last Comma in D

J: Last Word Before Province (i.e. last word of city name for multi-word cities, or full city name for single-word cities)

K: City Name Long (only filled for multi-word cities)

L: City

The formulas were:

Last Comma (I): =IF(ISERROR(SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",",""))))),"",SEARCH("^^",SUBSTITUTE(D2,",","^^",LEN(D2)-LEN(SUBSTITUTE(D2,",","")))))

Province (F): =IF(ISERROR(MID(D2,I2+2,2)),"",MID(D2,I2+2,2))

Postal Code (G): =IF(ISERROR(MID(D2,I2+5,LEN(D2)-I2-4)),"",MID(D2,I2+5,LEN(D2)-I2-4))

Country (H): =IF(NOT(ISERROR(VLOOKUP($F2, [provinces.xlsx]Sheet1!$B$2:$B$14,1, FALSE))), "Canada", IF(NOT(ISERROR(VLOOKUP($F2, [states.xlsx]Sheet1!$B$2:$B$51,1, FALSE))), "United States", "" ))

Last Word Before Province (J): =IF(ISERROR(RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ","")))))),"",RIGHT(LEFT(D2,I2-1),LEN(LEFT(D2,I2-1))-SEARCH("^^",SUBSTITUTE(LEFT(D2,I2-1)," ","^^",LEN(LEFT(D2,I2-1))-LEN(SUBSTITUTE(LEFT(D2,I2-1)," ",""))))))

From here, data was sorted based on province/state and a manual search located multi-word cities, whose proper names were then recorded in K2. The true city name was inserted into L2 by =IF(K2="",J2,K2).

Address (E): =IF(ISERROR(TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))),"",TRIM(SUBSTITUTE(LEFT(D2,I2-1),L2,""))) - removal of all components from the long address to produce the short address

The final Excel file was saved as a CSV.

ab_operator_city_names.xlsx has been added to WellWiki Data on Google Drive to assist with the process of assigning city names in the future.

Importing:

ab_operator_test is an intermediate table used to avoid duplicate key error #1062.

create table ab_operator_test select * from ab_operator where 1=0

LOAD DATA INFILE 'BusinessAssociateCodes_proper_cols.csv'

INTO TABLE ab_operator_test FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(ba_code, company, phone, @dummy, address, province, postal_code, country, @dummy, @dummy, @dummy, city)

INSERT INTO ab_operator (ba_code, company, address, city, province, postal_code, phone, country)

SELECT ba_code, company, address, city, province, postal_code, phone, country FROM ab_operator_test t2

ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country

DROP TABLE ab_operator_test

Setting blank fields to NULL:

UPDATE ab_operator SET city=NULL WHERE city=

The same operation was performed on province, postal_code, phone, and country.

Some blank strings where still present, so the following were used as well:

UPDATE ab_operator

SET city=NULL

WHERE length(city)<2

UPDATE ab_operator SET phone = REPLACE(REPLACE(phone, '\r', ), '\n', ) (followed by a repeat of the initial update statement)

For consistency of data:

update ab_operator

set country='United States'

where country='US'

ab_violations

The June 2014 data was presented as tables in a PDF document, and extracted by pasting the tables first into Word, then into Excel. The AER Group / Compliance Category column was split into two separate columns to insert into AER_GROUP and COMPLIANCE_CATEGORY in the database. ENFORCEMENT_ACTION_CATEGORY and RISK_CATEGORY were filled out based on the PDF document table headings. The ID # and AER Actions and Licensee Response/Follow-up columns were not used. The Excel document was saved as a CSV.

Importing June 2014:

LOAD DATA INFILE 'June2014_Enforcement_Action_Summary.csv'

INTO TABLE ab_violations FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(@dummy, LICENSEE, @dummy, NONCOMPLIANCE_DESC, ENFORCEMENT_DATE, LOCATION, ENFORCEMENT_ACTION_CATEGORY, RISK_CATEGORY, AER_GROUP, COMPLIANCE_CATEGORY)

The July 2014-June 2015 data was copied from http://www1.aer.ca/compliancedashboard/enforcement.html into Excel and saved in CSV format. The enforcement data here was less extensive than the earlier records.

Importing July 2014-June 2015:

LOAD DATA INFILE 'compliance_june_2015.csv'

INTO TABLE ab_violations FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 ROWS

(@dummy, LICENSEE, NONCOMPLIANCE_DESC, ENFORCEMENT_DATE)

Revision as of 09:57, 14 July 2015 (view source) Hailey (talk \| contribs) ← Older edit		Revision as of 15:32, 14 July 2015 (view source) Hailey (talk \| contribs) (→‎ab_operator) Newer edit →
Line 143:		Line 143:
	==ab_operator==		==ab_operator==

−	The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared.	+	The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ".", "-", "1", "na", "n/a", "none", "unknown", "not found", and "no address found" were cleared, as well as one case of city and postal codes being "unknown".

Track changes made to this page