July 19th, 2013, 09:55 AM
Haha That made me laugh. I will spend the next day designing a schema. I won't implement it just yet. I will come back to you with an image of how the table's will look and how they will relate. I will attempt to add tuple's to each of the tables. However, I don't think I will know exactly everything I will need. An example of this is the metadata table.
I will get back to you tomorrow
July 19th, 2013, 10:22 PM
There's really no better way to express something like a schema than just writing it out in some data definition language. The DDL of SQL could be better, but it isn't particularly bad for this. Don't get in the habit of diagramming or trying to whip up visuals beyond pencil sketches; its self defeating and a dirty mental habit best left to committee and sales types who need them to excite sponsors and customers (and themselves) with tales of the misunderstood.
Originally Posted by JoeyG1717
If you just draw a sketch right now, on your desk, of boxes or circle or smiley faces, each representing an entity (files, users, client devices etc.) and then try to point them to one another the way you think they should naturally relate you'll realize two things: by the time you are halfway done drawing the arrows you already are thinking how to make them better (your hand moves slower than your mind), and that the initial act of forcing yourself to segregate the ideas into boxes on the page set some concrete rules (a good first channelization of your effort).
And then you should try typing out a schema in DDL. Sure, you'll keep the sketch because it makes you feel like you won't forget everything, but pretty soon the sketch will be forgotten because the DDL will force more attention to detail on you, detail which is embodied in the DDL but not the sketch. That's where you go from having a nebulous idea to expressing something concrete that is testably workable or not (either outcome is useful, sketches aren't).
See, the DDL can express everything about the schema, because it is the schema. A sketch, visual, GUI builder tool, whatever, can only show you what your eyes can take in at once, so a lot gets glossed over. All that glossed over stuff tends to get actually forgotten in implementation later (because it becomes hard) and the world is chock full of crap, inconsistent data everywhere.
There's a whole industry built on data model visualization. Don't get sucked in. There is a huge difference between data model visualization and data visualization.
So show me a really simple SQL schema with foreign keys. Learn to think in those terms.
Last edited by zxq9; July 19th, 2013 at 10:24 PM.
July 20th, 2013, 05:46 AM
Originally Posted by zxq9
It's amazing when I speak to programmers and other people in the computing industry that they tell me not to do things when my University tell you the complete opposite. My database module (mysql) is all about creating VISUAL ERD's with stupid lines etc etc. Coding a database from scratch with the prompt was a much better way of understanding the syntax. I don't understand why my course seems to be so out of date when it comes to industry needed skills. Getting a Job next year will be fun.
Anyway, thanks for the added guidance with the SQL schema. I am Working over the next 3 days (I just never stop). I hope I can reply with a schema today at some point.
July 20th, 2013, 09:37 PM
This is part of the difference between the Ivory Tower of Academia and the Smelly, Filthy Real World.
When the focus is communication and not development visuals and diagramming techniques hold a seductive appeal. Its hard to get students to pay attention to a class full of code (unless they are math and engineering students). So its nice to have visuals. The problem is when you start realizing that the visuals don't really express the detail you want, and its easy to start thinking "man, it'd be great if there was a standard way -- a visual language -- in which to express relational/OO/FP/data/stack/flow/etc. concepts". And so someone somewhere starts working on this, with little branches and arrows of different forms that mean different things. At first its no more complex than what you'd do with a pencil prior to writing a schema.
Until the guy starts thinking "That worked so well for quickly expressing entity connections... but what if the entities live in an object system? Can I still diagram that?". Now he's opened a real can of worms, one that requires tens of millions in taxpayer money to fund huge standards committee efforts that run for over a decade and produce something like UML -- a standard so huge, baroque and hard to use that nobody does.
This is, of course, totally self defeating when the purpose is really just to give the easily distracted something to look at before shoving a page of code in their face and telling them to concentrate on it. It took me a long time to realize that this was the situation with modeling standards.
So my basic rule to visual modeling is that if you can't do it on a blackboard or piece of paper using only one color in less than 5 minutes then you are dealing with something that is either too detailed* to model visually or something you don't understand well enough to be teaching others yet. A related rule is that visual models are all destined for the trash bin anyway, so treating them as anything other than the trash they are is effort better spent working out the details in a coded version** of the model.
[* "Detail" here is really all about the choice and scale of the abstractions you're working in. This is another thing that took me quite a while to get a feel for.
** And even data languages aren't all created equally. Some of SQL's deficiencies have come to bother me so much that I'm working on the design of a different data language and implementing it as an extension to the Postgres engine. SQL is neither relational calculus nor relational algebra, a decision that makes for really bizarre corner cases. One consequence is that advanced models require more than just the DDL part of the language to express.]
Last edited by zxq9; July 20th, 2013 at 09:42 PM.
October 7th, 2013, 07:16 AM
I lost access to my JoeyG1717 account and the dev shed forums password reset is not functioning correctly.
My 3 days of work I mentioned in my previous post turned into 3 months. I hate web developing. I am now back to tackle my original project. Re-reading this whole thread is probably the best thing I have done to date so far. I was completely reliant on the web-based area of my system when in reality its just a form that calls a function to my client system.
Now getting back to the point, I was developing my postgresql database schema and I was hoping you could point me in the right direction for it.
I haven't implemented it yet but this is what I have got so far.
file_ID - PK int
file_name - varchar
file_data - text??? < not sure
file_size - long long or ??
date_added - date
user_ID - FK -int
user_ID PK -int
name - varchar
surname - varchar
username - varchar
email - varchar
password - varchar
This was as far as I got and I know its simple. I haven't thought about anything to do with encryption.
What do you think so far?
anything else I need to consider?
I really hope you are still around for advice
hope to here from you.
October 7th, 2013, 02:53 PM
Hope you're ready for a long, meandering reply...
Welcome to the Learning Curve. When you don't know anything at all everything is hard. Like moving mountains hard. But over time everything comes much, much clearer, so don't give up. 3 months is just about right for someone to get over the slimy part of larval stage.
Originally Posted by JoeyG1818
Welcome to the architectural mistake that has defined the last 15 years or so.
I told you to disregard the web for now (er, for then... whatever). But this is something you have to understand your own way before the whole "forget about the web for now" idea really makes sense. This is just one of the conceptual steps on your path up the learning curve.
That last statement should be encouraging, not make you feel bad. Look back now at how much stuff you've learned and how different your perspective is from when you started. That's a special kind of progress which is, after all, the most fundamental reason for attending school in the first place (another point often overlooked today...).
I'm in the mood today, so I'll go ahead and implement a version of that for you so you can see what I mean about using a DDL to model in:
CREATE TABLE USER
(username VARCHAR(20) PRIMARY KEY,
realname VARCHAR(40) NOT NULL,
password VARCHAR(100) NOT NULL);
CREATE TABLE file
(path VARCHAR(512) PRIMARY KEY,
name VARCHAR(50) DEFAULT '' NOT NULL,
SIZE INTEGER NOT NULL,
last_update timestamptz DEFAULT CURRENT_TIMESTAMP NOT NULL,
USER VARCHAR(20) REFERENCES USER NOT NULL);
Your version looks like a typical web-guy schema. Lots of meaningless data and a few things that you'll probably never actually use. Example: First/Last name on a proof of concept system? Nah, don't need it. Just a name will do, and maybe not even that -- consider this website, for example. Minimization is the name of the game. You can add stuff later -- for now just don't get bogged down in anything that doesn't demonstrate your point.
You'll notice I took some liberty with the merciless axing of several columns and left only one as an integer value (yep, all those columns and only one integer...). You may also notice that there are no nulls, and anything that might be left legitimately blank has a default value. You may also notice that "file" has a "path" and a "name". The name is for whatever you call the file yourself, which might be blank if you don't have a pet name for it -- the file name itself is part of the path already, so there's no point in repeating it.
Let's think about uniqueness and meaning for a moment... actually, while we're at it, let's remember what a data base is actually for as well (well, an RDBMS anyway).
[Begin long discussion of RDBMSs]
Your original schema had integer primary keys for data that already has far more natural candidate keys. In the case of users you have a username field which should be unique, an email field which should also be unique (well, usually), and then this arbitrary number column called "id" that everybody seems to be so in love with. In this sort of case you should just pick one of the meaningful unique columns and use that as your PK, because a number isn't going to mean anything to you when you're cruising the db trying to work out how a query works or invent a new query to do something slick later on down the road. Those numeric IDs suck. Using the username itself has this magical benefit of letting you query just the file table and knowing exactly who, by name, owns a file.
You'll hear some (totally unsubstantiated) blather by people whose only language is PHP or Java and whose only DB experience is older versions of MySQL (which couldn't define actual FKs and therefore couldn't enforce constraints by using an index...) that "well, an integer key is the right choice because its faster to look up than text...blahblahblah". This is silly. Text is a number if you ask the computer -- actually, everything is a number to the computer, which means everything is hashable, which is precisely what happens in the index that gets created whenever you say something is a PK. So even if you use a numeric key, it gets hashed into some new number and that new number is placed in sorted order within the index so that the DB can do a really fast binary search instead of scanning to find that number (on the subject of binary search, you may find [this] amusing/interesting/a-complete-waste-of-time). The exact same thing happens with indexes over text as well (the numeric value of the text gets hashed into a new number), because there is no difference to the computer between an integer that represents itself and an integer that we decode as a representation of a text string (or a song, or some pr0nz, or whatever -- its all just numbers).
So integer IDs -- they suck and should be avoided in nearly every conceivable case. They don't mean anything, they always mandate that you join to at least one other table to make any query meaningful, and they don't contribute to uniqueness in the row. There are cases where you want numeric IDs, and those cases are usually obvious because they completely lack uniqueness and providing a numeric ID is the only way to make sure a row is unique (these cases are actually quite rare, though). Multi-column IDs rule, by the way -- and the more experience you gain with normalized schemata the more often you'll find that quite a few tables are nothing but a primary key, and those tables contain quite detailed data on their own and usually represent a complete answer to a question on their own (this will blow your mind the first time you stumble on it yourself).
And that brings me to what RDBMSs are actually for. And this matters. Its not there to be a "persistence" layer -- that's what a filesystem is for. Its also not to be purely temporary state within a process transaction that is occurring somewhere external to the DB itself (you'd be surprised how much web framework code actually iteratively updates DB tables in the middle of a loop that's got nothing to do with the final production of an answer without anyone realizing it).
RDBMSs are supposed to provide meaningful answers to questions.
So if you find yourself:
- Storing lots of data in the database that doesn't contribute to production of answers meaningful within the problem domain (or)
- Pulling lots of basic, raw facts from the database only to process them elsewhere so they can be sent along to a presentation layer somewhere else (or)
- Using it mostly/purely as a filesystem -- atop a filesystem... (or)
- Using the DB as a state storage mechanism in the middle of a loop function somewhere (or)
- Processing presentation details within the DB before passing it along to a presentation layer
Then you are missing the point. With these ideas in mind it is much easier to come up with solid, readable, meaningful schemas that not only provide a place to store permanent facts about your system, but actually represent the rules of your system by the definition of the data structure itself.
Read the last part of that last system again. A solid schema actually embodies your "business rules" (but not necessarily business logic) without a lot of procedural code required to define special cases or do a bunch of checks or case filtration. SQL is incomplete in this regard, so occasionally you need some external help, but generally speaking this is true, and it has a magically positive impact on projects where you take the time to design your data properly.
No worries about encryption so far, and that was a pretty good first stab at a schema. You might find things you want to add on to the schema later. That's not a problem, and its a bridge you can cross when you come to it.
Something I forgot to mention about the schema -- you may have noticed that the first thing in the file table is "path", and its a text value. That's because it is the path within the local filesystem to where the file is actually stored. Its usually best to make this a relative path, not an absolute path, and let the system configuration guide wherever the path root is. The db shouldn't have to worry about it unless its doing the file saving/sending itself, in which case the "system configuration" would be a new table of its own, probably with a single row or with two columns and a key-value list of settings and values per row. I've saved files directly in the DB with Postgres (and actually, I've never experienced a performance degradation because of this) but generally speaking I find it best to let file systems be file systems, and DB table be digests/indexes of meaningful data about those files. That lets me use any sort of client program or environment to retrieve files instead of being limited to ones that know how to talk to Postgres.
As far as "anything to add"... well, I gave it to you with both barrels above in the discussion about relational databases and what they are really for. On that note, you may find [this post I wrote in response to a simple schema question] and subsequent discussion interesting, I took the time to flesh out a schema and explain why it was done that way. Note the reactions as well. Some people think 5 tables (costing all of about 30 lines of code) is "pretty complicated" but for some reason don't bat an eye at the hundreds of lines of procedural code required to un-funk the raw, essentially unguaranteed (as in, possibly wrong) data and massage it into a meaningful answer outside the DB -- and this is totally ignoring the other lines of code required to sanitize data on the way in and process it for compliance with "business rules" which wind up getting defined in an endless forest of conditionals instead of just being simply laid into the original 30 lines that define those simple, non-null, interrelated tables. The false economy this represents always baffles me.
Its funny that you posted when you did -- I just came back to devshed yesterday for the first time in several months to change my registration email address (apparently some mailers refuse to route mail from this domain; which might have something to do with you losing your old account...).
Last edited by zxq9; October 7th, 2013 at 03:02 PM.
October 7th, 2013, 03:33 PM
I couldn't be more ready! I feel like a completely different programmer than before. I had to do a web project for a relative and it has made me realise how terrible web developing is as a career path. No thanks.
Originally Posted by zxq9
It's pretty crazy when you think about how the next generation of computer specialist are going to know almost nothing about Languages, unless they relate to the internet.
I do feel like I have learn't something already which is what the whole point of doing a degree is in the first place. However, I don't have a lot of time to get this system working. I am almost at the 'pulling my hair out' category of despair.
Thank you for restructuring my db Schema. It answered a few questions I had about what was 'needed'. I always wondered why lecturers were so 'in love' with these number 'id_pks'. I literally has asked myself the same thing you have mentioned. Maybe its just a way to teach people easily without confusing an entire class.
Your information that you posted regarding RDBMSs was really helpful. So Thank you.
I believe I am now heading in the right direction at least.
Now I have a schema which is implemented (only locally for now I might add). What is my next move? back to python for some coding I assume?
also, this system that you have been helping me develop is going to look impressive (I hope) to examiners. There is an element in my dissertation write up where I have to 'test' the system.
So effectively I am going to need to ethically 'hack' it. Now from what you have been telling me so far. This system should be pretty secure. I will probably have to break it a few times to allow me to 'prove my point'.
How would a hacker go about attacking a system like this, considering the client side is such an integral part of it?
oh really, Well that's a rather nice coincidence for me. I have tried talking to lecturers at my University about the concerns I have. none of them have a clue what they are on about..
I am still thinking at the back of my mind that I might have to simplify the system. I like python but I find myself spending hours researching syntax. Frustrating.
October 7th, 2013, 10:56 PM
Its good/terrible depending no what your goals are. For some the real goal is to employ an army of programmers to implement relatively trivial functionality because it gives the appearance that "something's being done", and the vast volumes at which activity is being had makes a small margin % over a huge base figure accordingly huge itself. Ultimately counter-productive, but much that we do today is. You'll find that the human component (the good and the bad) plays the dominant role in all things IT, which still strikes me as weird because nearly everything that we do breaks down to a stylized notation for abstract math...
Originally Posted by JoeyG1818
That would be fine if what you really meant wasn't "the web" instead of "the internet". I can't imagine a successful language in the next decade that doesn't naturally support networking and parallelism -- but I also have a difficult time understanding how document markup standards have been mistaken for data interchange formats or languages themselves.
Anyway, let's make a deal. I won't pull my hair out over the crap people impose on themselves if you don't pull your hair out over getting this project working.
Now that you have a schema that can do more than just save a simple string, see if you can get it to log someone in so that that user can save strings.
But logging in is something we haven't discussed. You have two options: create a "session" table to manage your concept of sessions, or use the Postgres session mechanism directly.
If you do your own session system then the session table references the user table (and a column that says when the user logged in, when the session expires, and when the user logged out -- a session is dead when "logged_out" is not null OR current_timestamp > session_expiry). Logging in can just check the saved password text against the provided password text.
If you use the Postgres session directly then you have to (indirectly) reference Postgres users. In this case its easiest to either create a database per user or a schema per user that is owned by them, and store that user's stuff in there. In the case of your application I think this is the easiest route and the most secure.
Your Postgres server is already able to talk to the network -- so the concept of "running locally" doesn't really apply. If your Python script (client code) can connect to Postgres over a network port, then the script can be run on a remote computer or the same computer -- there's no difference.
Write a Python script that can log in as a user, save some file data, read some file data back (or pull a list of file data rows and print them to the screen), and logout (close the session).
In other words, do the same thing you did before, but with more columns and tables being touched. Its just slightly more involved syntax than your original script experiment -- but this is where you start opening doors to more interesting functionality.
The primary way to attack the system will be to compromise the server where the files are stored (you'll only get encrypted data, but the directory structure will at least tell you what users are holding what files and how big they are). The other way to attack the system would be to compromise a client system and use it to pull data directly. The client-side attack is the most powerful, but it only gives access to that one user's data. If done correctly there will be no way to compromise the entire system at one time.
Then write more code in Python. This is a problem in every language, but it is a lot less problematic in Python than just about any other language I can think of (scheme aside). But consider whether its really syntax or a lack of knowledge of the built-in functions, the way class/function grammar works, or a lack of awareness about what modules are built into the core distribution.
If you're writing code instead of dreaming about how awesome your system is going to be* then syntax will really only be a problem for a day or so. After that you'll realize that you missed something here or there about how certain built-in functions work. After that you'll realize you need to read the definition of a list comprehension VS a generator (or whatever) more carefully, or review the grammar behind class or function defintions or something. But this is OK. If you can write and read even a little code every day then these issues will be generally behind you after a week. Need for library reference is eternal (or maybe I'm just a forgetful person) but the knowledge of how to navigate the core Python documentation makes this much easier. A lot of programming is learning how to combine a good, workable idea with "library science" skills knowing how to look up what you need to get the intended job done.
[* An easy trap to fall into... remember doing something si always better than doing nothing, so go write your idea, even if it barely runs, right when you have it -- you can then do the fun thinking part on top of an executable foundation, which is a huge leg up.]
Anyway... pull your simple "put a string into Postgres, get a string out" Python script you wrote before. Add to it so that it can add to the tables you created. You'll stumble first on a syntax issue, figure that out in a few minutes, then stumble on the awkward question of "how should my program use the schema, anyway?" Which will lead directly to the next things you will need to do to get your system running.
Paste your script here. I wanna see it.
Last edited by zxq9; October 7th, 2013 at 11:02 PM.
October 11th, 2013, 11:27 AM
Okay here goes,
I have been working on the script you asked for. I used some documentation etc.
The script is not connected to anything other than the database. It just runs in the python shell.
I can connect to the database okay. It can add a 'file' with the query at the bottom which did work fine.
However the little login system I created seems to fail when entering the password. It just prints 'failed login' even if the details input are correct?? I have no idea why. I have thought of everything possible without getting angry about it.
Now I tried to get into the sessions stuff you mentioned but I will admit to getting a bit confused. You talked about Postgres having its own session mechanism? I did a small amount of research into this but stopped at implementing to the script I have now.
It's nothing fancy but make of this what you will...
(I have put some asterisks on private data as this is on the internet.)
## login script for client side system
conn = psycopg2.connect(host="127.0.0.1",
print ("Database connected successfully")
print ("Please Login to continue")
cur = conn.cursor()
loop = 'true'
while(loop == 'true'):
username = input("username: ")
password = input("password: ")
if(cur.execute("SELECT * FROM users WHERE username='" + username + "' AND password='" + password + "'")):
print ("Failed to login")
print ("Sending File")
cur = conn.cursor()
cur.execute("INSERT INTO file (path,name,size,last_update,username) \
VALUES ('C:***/*****/Desktop/Goodbye.txt', 'Goodbye.txt', 10, '2013-10-11 14:00:00 +02:00', 'JoeyG' )");
print ("File Sent successfully");