If you find this page useful and would like to be notified of changes made to this page, start by inputting your email below.
powered by ChangeDetection
(→ab_operator) |
|||
| Line 143: | Line 143: | ||
==ab_operator== | ==ab_operator== | ||
| − | The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ., -, 1, na, n/a, none, unknown, not found, and no address found were cleared. | + | The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ".", "-", "1", "na", "n/a", "none", "unknown", "not found", and "no address found" were cleared, as well as one case of city and postal codes being "unknown". |
Tables are backed up using mysqldump to ~/database_backups/ before starting.
CSVs are copied to the server. Their group and ownership are changed to 'mysql' using the chgrp and chown commands. They are then moved to /var/lib/mysql/wellsdb_dev/ such that they can be read by mysql. Ensure that CSVs are free of special characters (e.g. \n and \) and extra quotation marks ".
The Reserves Sequence Code column was deleted from Excel document because it is not included in our database table. 524 duplicate rows were then deleted and the file saved as a Windows Comma Separated (.csv).
Importing:
CREATE TABLE ab_pool_codes_tmp AS (SELECT * FROM ab_pool_codes where 0=1)
LOAD DATA INFILE 'CommingledPoolList.csv'
INTO TABLE ab_pool_codes_tmp FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(field_name, production_pool, field_code, production_pool_code, @dummy, geological_pool_code, geological_pool_name, confidential)
(17000 rows inserted)
INSERT INTO ab_pool_codes
SELECT * FROM ab_pool_codes_tmp t2
WHERE (t2.field_name, t2.production_pool, t2.field_code, t2.production_pool_code, t2.geological_pool_code, t2.geological_pool_name, t2.confidential)
NOT IN (SELECT field_name, production_pool, field_code, production_pool_code, geological_pool_code, geological_pool_name, confidential FROM ab_pool_codes)
(17000 rows added - there were no duplicates between the old and new data)
DROP TABLE ab_pool_codes_tmp
Setting blank fields to NULL:
UPDATE ab_pool_codes
SET geological_pool_name=NULL
WHERE geological_pool_name=
UPDATE ab_pool_codes
SET geological_pool_name=NULL
WHERE LENGTH(geological_pool_name)=1
UPDATE ab_pool_codes
SET confidential=NULL
WHERE LENGTH(confidential)=1
Text files were retrieved from http://www.aer.ca/data-and-publications/statistical-reports/st102, opened in Excel, and saved as CSV.
Importing:
CREATE TABLE ab_facility_tmp AS (SELECT * FROM ab_facility where 0=1)
LOAD DATA INFILE 'ActiveFacility.csv'
INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)
LOAD DATA INFILE 'InactiveFacility.csv'
INTO TABLE ab_facility_tmp FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS)
(39314 + 66767 = 106081 rows inserted)
INSERT INTO ab_facility
SELECT * FROM ab_facility_tmp t2
WHERE (t2.FACILITY_ID, t2.FACILITY_NAME, t2.OPERATOR_CODE, t2.OPERATOR_NAME, t2.SUB_TYPE_CODE, t2.SUB_TYPE, t2.LE, t2.LSD, t2.SEC, t2.TWP, t2.RNG, t2.MER, t2.LICENSE_NUMBER, t2.EDCT_CODE, t2.EDCT_DESCRIPTION, t2.LICENSE_CODE, t2.OPERATIONAL_STATUS)
NOT IN (SELECT FACILITY_ID, FACILITY_NAME, OPERATOR_CODE, OPERATOR_NAME, SUB_TYPE_CODE, SUB_TYPE, LE, LSD, SEC, TWP, RNG, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, LICENSE_CODE, OPERATIONAL_STATUS FROM ab_facility)
(87923 rows added)
DROP TABLE ab_facility_tmp
Setting blank fields to NULL:
The following query was run to replace blanks with NULLs in all columns except well_id:
UPDATE ab_facility
SET FACILITY_ID=NULL
WHERE LENGTH(FACILITY_ID)=0
Similar operations where performed to replace blank fields with a length of 1 (taking care not to erase any real single digit values). Columns affected were LSD, MER, LICENSE_NUMBER, EDCT_CODE, EDCT_DESCRIPTION, and LICENSE_CODE.
The Excel file was obtained from the Business Associate Codes (ST104A) section on http://www.aer.ca/data-and-publications/statistical-reports/st104. Originally, the address was given in a single column and contained the street number, city, province, and postal code. These individual components were extracted into separate fields. Leading and trailing whitespace was removed from cells and the address cells containing ".", "-", "1", "na", "n/a", "none", "unknown", "not found", and "no address found" were cleared, as well as one case of city and postal codes being "unknown".
Excel formulas extracted the address components. The columns in the Excel sheet were as follows:
A: BA Code
B: Company Name
C: Phone
D: Address Long
E: Address (i.e. house/box number)
F: Province
G: Postal Code
H: Country
I: Position of Last Comma in D
J: Last Word Before Province (i.e. last word of city name for multi-word cities, or full city name for single-word cities)
K: City Name Long (only filled for multi-word cities)
L: City
The formulas were:
The final Excel file was saved as a CSV.
ab_operator_city_names.xlsx has been added to WellWiki Data on Google Drive to assist with the process of assigning city names in the future.
Importing:
ab_operator_test is an intermediate table used to avoid duplicate key error #1062.
create table ab_operator_test select * from ab_operator where 1=0
LOAD DATA INFILE 'BusinessAssociateCodes_proper_cols.csv'
INTO TABLE ab_operator_test FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(ba_code, company, phone, @dummy, address, province, postal_code, country, @dummy, @dummy, @dummy, city)
INSERT INTO ab_operator (ba_code, company, address, city, province, postal_code, phone, country)
SELECT ba_code, company, address, city, province, postal_code, phone, country FROM ab_operator_test t2
ON DUPLICATE KEY UPDATE ba_code=t2.ba_code, company=t2.company, address=t2.address, city=t2.city, province=t2.province, postal_code=t2.postal_code, phone=t2.phone, country=t2.country
DROP TABLE ab_operator_test
Setting blank fields to NULL:
UPDATE ab_operator
SET city=NULL
WHERE city=
The same operation was performed on province, postal_code, phone, and country.
Some blank strings where still present, so the following were used as well:
UPDATE ab_operator
SET city=NULL
WHERE length(city)<2
UPDATE ab_operator SET phone = REPLACE(REPLACE(phone, '\r', ), '\n', )
(followed by a repeat of the initial update statement)
For consistency of data:
update ab_operator
set country='United States'
where country='US'
The June 2014 data was presented as tables in a PDF document, and extracted by pasting the tables first into Word, then into Excel. The AER Group / Compliance Category column was split into two separate columns to insert into AER_GROUP and COMPLIANCE_CATEGORY in the database. ENFORCEMENT_ACTION_CATEGORY and RISK_CATEGORY were filled out based on the PDF document table headings. The ID # and AER Actions and Licensee Response/Follow-up columns were not used. The Excel document was saved as a CSV.
Importing June 2014:
LOAD DATA INFILE 'June2014_Enforcement_Action_Summary.csv'
INTO TABLE ab_violations FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(@dummy, LICENSEE, @dummy, NONCOMPLIANCE_DESC, ENFORCEMENT_DATE, LOCATION, ENFORCEMENT_ACTION_CATEGORY, RISK_CATEGORY, AER_GROUP, COMPLIANCE_CATEGORY)
The July 2014-June 2015 data was copied from http://www1.aer.ca/compliancedashboard/enforcement.html into Excel and saved in CSV format. The enforcement data here was less extensive than the earlier records.
Importing July 2014-June 2015:
LOAD DATA INFILE 'compliance_june_2015.csv'
INTO TABLE ab_violations FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(@dummy, LICENSEE, NONCOMPLIANCE_DESC, ENFORCEMENT_DATE)