|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Stop making mediocre tutorials.The best tutorials are video! Camtasia Studio makes it easy to create engaging, buzz-building screen videos at any size, in any popular format. Download the free trial!
|
|
#1
|
|||
|
|||
|
Normalizing a non-normalized data source
Here's the deal:
1) Data source is a CSV 2) Data source is NOT normalized 3) Multiple fields for similar data 4) Data entry was free form not restricted (see below) 5) New data will be downloaded weekly (if not more frequently) 6) There is NO chance future data sources will be normalized, nor data entry to be fixed (i.e. we have no control over that aspect) Re #4: e.g. one field is school district and may contain multiple values for the same district (one user may enter "Town Name School District" another "Town Name SD" and yet another "Town Name Schools" Project specs call for most of these fields to be searchable and therefore need to be normalized. Importing must be done automatically, i.e. a human cannot "fix" the unnormalized data at import time. But later must be able to "fix" the non-normalized data. I've thought about using 3 tables: 1) the normalized table. In the SD exmample: school_district_ID - sequence school_district_name - name of school district 2) an alternate table school_district_ID - from table one school_district_alt - alternate name of school district 3) unnormalized school_district_ID - sequence school_district_name - name of school district When a new record is imported the name provided would be compared against each table in succession and if not found would be added to the unnormalized table. When "human intervention" takes place a list from the unnormalized table would be provided and the user would need to assign that to it's proper normalized name (which would add the name to the alternate table for future matching) or add a new entry in the normalized table (assuming that the unmatched item didn't exist previously). In both cases the entry would be removed from the table containing the non normalized data and the records referencing the ID would be updated. To do this the sequences from tables 1 and 3 would have to be guaranteed to be unique across both tables. They could use the same sequence or use seperate sequences with a much higher starting value for one (with a corresponding max for the other). The benefit to the later is that I could programmatically know which table the reference is for, without doing a lookup. Anybody have any other ideas on how to deal with this?
__________________
FSBO (For Sale By Owner) Realty |
|
#2
|
||||
|
||||
|
Can full text index on "school district" help solving the search problem?
|
|
#3
|
|||
|
|||
|
No, not really. The school district is just an example. There are dozens of fields some would be M:1 some M:M relations. If that were the only field, then that might be an option. But with as many as we are talking about it's not viable. Plus the fact that some of the abbreviations used in some fields wouldn't be really conducive to a full text search.
|
![]() |
| Viewing: Dev Shed Forums > Databases > Database Management > Normalizing a non-normalized data source |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|
|