Database Management
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsDatabasesDatabase Management

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Stop making mediocre tutorials.The best tutorials are video! Camtasia Studio makes it easy to create engaging, buzz-building screen videos at any size, in any popular format. Download the free trial!
  #1  
Old June 19th, 2003, 10:16 AM
rod k rod k is offline
Apprentice Deity
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Jul 1999
Location: Niagara Falls (On the wrong side of the gorge)
Posts: 3,237 rod k User rank is Private First Class (20 - 50 Reputation Level)rod k User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 4 m 8 sec
Reputation Power: 12
Send a message via AIM to rod k
Normalizing a non-normalized data source

Here's the deal:
1) Data source is a CSV
2) Data source is NOT normalized
3) Multiple fields for similar data
4) Data entry was free form not restricted (see below)
5) New data will be downloaded weekly (if not more frequently)
6) There is NO chance future data sources will be normalized, nor data entry to be fixed (i.e. we have no control over that aspect)


Re #4: e.g. one field is school district and may contain multiple values for the same district (one user may enter "Town Name School District" another "Town Name SD" and yet another "Town Name Schools"

Project specs call for most of these fields to be searchable and therefore need to be normalized.

Importing must be done automatically, i.e. a human cannot "fix" the unnormalized data at import time. But later must be able to "fix" the non-normalized data.

I've thought about using 3 tables:
1) the normalized table. In the SD exmample:
school_district_ID - sequence
school_district_name - name of school district

2) an alternate table
school_district_ID - from table one
school_district_alt - alternate name of school district

3) unnormalized
school_district_ID - sequence
school_district_name - name of school district

When a new record is imported the name provided would be compared against each table in succession and if not found would be added to the unnormalized table.

When "human intervention" takes place a list from the unnormalized table would be provided and the user would need to assign that to it's proper normalized name (which would add the name to the alternate table for future matching) or add a new entry in the normalized table (assuming that the unmatched item didn't exist previously). In both cases the entry would be removed from the table containing the non normalized data and the records referencing the ID would be updated.

To do this the sequences from tables 1 and 3 would have to be guaranteed to be unique across both tables. They could use the same sequence or use seperate sequences with a much higher starting value for one (with a corresponding max for the other). The benefit to the later is that I could programmatically know which table the reference is for, without doing a lookup.

Anybody have any other ideas on how to deal with this?
__________________
FSBO (For Sale By Owner) Realty

Reply With Quote
  #2  
Old June 19th, 2003, 10:41 AM
pabloj's Avatar
pabloj pabloj is offline
Modding: Oracle MsSQL Firebird
Dev Shed God 5th Plane (7000 - 7499 posts)
 
Join Date: Jun 2001
Location: Outside US
Posts: 7,409 pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level)pabloj User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 2 Months 2 Weeks 9 h 28 m 23 sec
Reputation Power: 255
Can full text index on "school district" help solving the search problem?

Reply With Quote
  #3  
Old June 19th, 2003, 10:50 AM
rod k rod k is offline
Apprentice Deity
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Jul 1999
Location: Niagara Falls (On the wrong side of the gorge)
Posts: 3,237 rod k User rank is Private First Class (20 - 50 Reputation Level)rod k User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 4 m 8 sec
Reputation Power: 12
Send a message via AIM to rod k
No, not really. The school district is just an example. There are dozens of fields some would be M:1 some M:M relations. If that were the only field, then that might be an option. But with as many as we are talking about it's not viable. Plus the fact that some of the abbreviations used in some fields wouldn't be really conducive to a full text search.

Reply With Quote
Reply

Viewing: Dev Shed ForumsDatabasesDatabase Management > Normalizing a non-normalized data source


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

 Free IT White Papers!
 
Accelerating Trading Partner Performance
One in five. That's how many partner transactions have at least one error. That is an amazing statistic, particularly given the extraordinary leaps in innovation across the global supply chain during the past two decades. Download this white paper to learn more.

 
Competing on Analytics
This Tech Analysis is designed to help identify characteristics shared by analytics competitors, and includes information about 32 organizations that have made a commitment to quantitative, fact-based analysis.

 
Cost Effective Scaling with Virtualization and Coyote Point Systems
An overview of the industry trend toward virtualization, how server consolidation has increased the importance of application uptime and the steps being taken to integrate load balancing technology with virtualized servers.

 
Five Checkpoints to Implementing IP Telephony
Implementation planning for IP PBX software and IP telephony has become vital as businesses replace discontinued legacy PBX phone systems. This informative whitepaper outlines five "checkpoints" for any implementation plan that will help make IP communications a successful proposition.

 
Hosted Email Security: Staying Ahead of New Threats
In the last two years, email has become a fierce battleground between the nefarious forces of spam and malware, and the heroes of messaging protection. The spam volumes increased alarmingly every month, bringing clever new forms of phishing and virus propagation attacks.

 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway