#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Posts
    33
    Rep Power
    12

    program effeciency and data corruption


    first off, this is written in C. what it does is grab huge text files, like 1.5 million lines of text that is log files for when a user on unix uses a piece of software. if your familiar it is the log file that 'flexlm' produces. in my program i have a structure with components for username, timestamp, date, machine ect. then i create arrays of this structure for IN and OUT logs. these arrays are of size around 60,000. whats happening is when i parse the information into this structure everything is fine, but when i pass these arrays to other functions, to say match up like usernames and such, the data ends up getting corrupted. this doesnt happen when there is only like 200 entries in the array, only when they are very large. what causes this? is there a way to stop that from happening? would this be a use for pointers, instead of having the arrays global, pass pointers to the arrays? also would using pointers make the program run faster? when i have functions that have to keep scrolling through these huge arrays to match things up it takes a very long time. sorry for all the questions. this is my first C program that has to deal with such large amounts of data where time and other things become an issue.
    thanks
    Last edited by samlab; September 23rd, 2003 at 09:59 AM.
  2. #2
  3. I'm Baaaaaaack!
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    Jul 2003
    Location
    Maryland
    Posts
    5,538
    Rep Power
    244
    We will have to see some code to diagnose your problem(s). Please enclose the code in "code" tags (see http://forums.devshed.com/misc.php?action=bbcode&s=). Please strip out any code that is not relevant to the problem and be sure it compiles and runs. Also provide a small subset of your input data. Off hand, I would speculate you have a memory overwrite somewhere and that is why your data is being corrupted.

    Pointers keep you from copying data, but since arrays are already pointers, you do not copy arrays when you pass them as arguments. Globals are sometimes required but usually a convenience to minimize the amount of function arguments. Just keep in mind that when you pass an object by pointer, it can be changed by the called function. To avoid that, make the pointer const, that way no changes are allowed (by the compiler, you can still violate this if you want to).

    Program speed and efficiency depend to a much larger extent on design and algorithm choice and to a much lesser extent on the choice of such low-level details (though I would use pointers in optimized code). Back with having a dual 256 MhZ computer with 256 MB of RAM was a hot thing (and back when the Alpha chip was something to crow about) I wrote a program to match up 15 GB of files against 27 million SSNs. I read in all the SSNs into RAM(conveniently sorted for me), indexed them (had to use some compression as well), then read each record from the various files (keep in mind that there was not enough disk space to store the input and output, I had to compress the input files!). It ran in 3 hours and was I/O bound (meaning it would never run any faster with the disks I was using to store the data). Compare that to an estimated 15 days for the same processing on an IBM using SAS and you can see that design can have a big impact on performance.

    My blog, The Fount of Useless Information http://sol-biotech.com/wordpress/
    Free code: http://sol-biotech.com/code/.
    Secure Programming: http://sol-biotech.com/code/SecProgFAQ.html.
    Performance Programming: http://sol-biotech.com/code/PerformanceProgramming.html.
    LinkedIn Profile: http://www.linkedin.com/in/keithoxenrider

    It is not that old programmers are any smarter or code better, it is just that they have made the same stupid mistake so many times that it is second nature to fix it.
    --Me, I just made it up

    The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.
    --George Bernard Shaw
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Posts
    33
    Rep Power
    12
    Its too much code to post. Ive gone through it and deleted everything that was not relevent to the problem. Ive also attached a sample of the file it reads in. If you put a watch on the first few IN and OUT index you will see what happens. As soon as in my main program it calls the GetAppUsage() function the data in the IN and OUT arrays gets corrupted. :confused: Pay special attention to the OUT. In oreder to attach the c file I had to copy it into a text file called test.txt.
    Thanks
    Sam
    Attached Files
    Last edited by samlab; September 24th, 2003 at 08:45 AM.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Posts
    33
    Rep Power
    12
    here is the file it reads in
    Attached Files
  8. #5
  9. I'm Baaaaaaack!
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    Jul 2003
    Location
    Maryland
    Posts
    5,538
    Rep Power
    244
    At this point all I can say without a doubt is that you have an out-of-bounds memory write somewhere. I will continue to look at the code, but when I first compiled it, I got a crash when line ~147 executed ( fprintf(outfile, "****************** NUMBER OF USAGES BY APPLICATION *******************\n");
    ) because of a bad file pointer (outfile). When I moved the declaration of outfile from global to local to main(), the problem went away and the program completed. That does NOT mean the problem is fixed, only that the memory overwrite is now occuring somewhere else and is not causing the program to crash. More than likely you have an off-by-one error somewhere, C/C++ programmers do it ALL the time. You may also have a bad pointer somewhere that you are writing to when you shouldn't. I will tinker with it a bit, but these sorts of problems can be very difficult to find without helper software (such as Purify). One thing you can try, since you have a vested interest in getting it fixed in a timely manner, is put 'guard' values at the head and tail of each of your arrays (i.e., expand the arrays by two, then do your indexing from cnt=1 to cnt < max-1). Then, write a routine to dump the guard values (they can be any value, zero, all bits set to 1, whatever) at strategic parts of the program. When the guard values have been overwritten, grab a microscope and go over each line proceeding the overwrite. You can play this trick to bracket the actual line that is causing the problem, though it still may not be obvious what you did to cause it (indeed, the actual overwrite is likely to be occuring in a completely different array).

    Good luck!

    My blog, The Fount of Useless Information http://sol-biotech.com/wordpress/
    Free code: http://sol-biotech.com/code/.
    Secure Programming: http://sol-biotech.com/code/SecProgFAQ.html.
    Performance Programming: http://sol-biotech.com/code/PerformanceProgramming.html.
    LinkedIn Profile: http://www.linkedin.com/in/keithoxenrider

    It is not that old programmers are any smarter or code better, it is just that they have made the same stupid mistake so many times that it is second nature to fix it.
    --Me, I just made it up

    The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.
    --George Bernard Shaw
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Posts
    33
    Rep Power
    12
    thank you very much for looking at it. i am very frustrated with it at this point. i will try that trick you said about using array guards. i always seem to have stupid errors like this that are near imposible to find. where can i get that Purify program from?
    I never had a crash from the output file pointer, thats a little odd. just incase you may need to look at it ive attached the full program. its not that much longer than what i already posted, but since you said your getting this other error i was worried i may have edited somehting i shouldnt have. thanks again for looking at this.
    Attached Files
  12. #7
  13. I'm Baaaaaaack!
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    Jul 2003
    Location
    Maryland
    Posts
    5,538
    Rep Power
    244
    These sorts of error manifestations are compiler (and compiler flag) and OS dependant. Change the layout of your code or size of your data structures and the type of error can change, disappear, blow apart, etc. It happens to everyone who does any serious C programming and it can be a nightmare to debug. Since it (apparently to yoru debugging efforts) randomly overwrites memory locations with random data, you spend all your time chasing down errors that are actually incendental to the real problem.

    The guard values are what most of these memory checkers do, they just do it without your having to change the code. I think Purify is by Sun and I am pretty sure it is not free. I can't think of the generic name for what Purify does, but if you google on it you may get on the track to some freeware (let me know if you find some). Another thing you can do is put bounds checkers everywhere you touch your arrays and throw an error if you are out of bounds. Just putting in the bounds checkers and/or the guard values can reveal where you are going over the bounds. Just remember: arrays start at 0 and run to maxvals - 1 (it bears repeating!).

    I probably won't get a chance to look at it today, I am still recovering from a trip to the Philippines and think I will be back in bed before long, so good luck with it.

    My blog, The Fount of Useless Information http://sol-biotech.com/wordpress/
    Free code: http://sol-biotech.com/code/.
    Secure Programming: http://sol-biotech.com/code/SecProgFAQ.html.
    Performance Programming: http://sol-biotech.com/code/PerformanceProgramming.html.
    LinkedIn Profile: http://www.linkedin.com/in/keithoxenrider

    It is not that old programmers are any smarter or code better, it is just that they have made the same stupid mistake so many times that it is second nature to fix it.
    --Me, I just made it up

    The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.
    --George Bernard Shaw
  14. #8
  15. I'm Baaaaaaack!
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    Jul 2003
    Location
    Maryland
    Posts
    5,538
    Rep Power
    244
    Rather than attempt to diagnose your problem, I decided to give you some example code you can study that (presuming I didn't do anything boneheaded) works fine. Personally, I would do this in C++ and use maps and vectors. It would run slower and be less efficient, but if it ran twice as slow it would still take less than a second. I just did bits of things for you and showed you various different techniques, please don't take ANY of this a either good programming practice, efficient coding technique, or even a reasonable method for achieving this end. I am very familiar with bouncing around in blobs of dynamically allocated memory and this is very simple for me. You may find it is un-necessarily complicated.

    Enjoy!

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #include <time.h>
    
    #define TYPELOCATOR 20
    
    /*this will hold whatever data you want to pass around*/
    /*from routine to routine.  add or modify as you like*/
    struct DATA{
        int numRecords;
        int currentRecord;
        int numOUT;
        int numIN;
        int numDENIED;
        int numIGNORED;
        int recordLengthMax;
        char *myData;
        FILE *fin;
        FILE *fout;
    };
    
    /*by defining a routine as you declare it,
      as long as it is before you use it,
      means you don't need to do both*/
    
    /*this blasts through the input file and checks for the
      number of lines and the maximum length of a line.
      this is not an ideal way of doing this, you
      have to read through the file twice.  I leave
      it to you to come up with a better way*/
    void checkInputFile(struct DATA *myData){
        int currentRecordLength = 0, maxRecordLength = 0;
        int numRecords = 0;
        int inChar;
    
        while ((inChar = fgetc(myData->fin)) != EOF){
            if ((char)inChar == '\n'){
                numRecords++;
                if (currentRecordLength > maxRecordLength)
                    maxRecordLength = currentRecordLength;
                currentRecordLength = 0;
            }else{
                currentRecordLength++;
            }
        }
        /*this is because the final record may lack a linefeed*/
        if (currentRecordLength > maxRecordLength){
            maxRecordLength = currentRecordLength;
            maxRecordLength++;
        }
    
        myData->numRecords = numRecords;
        myData->recordLengthMax = maxRecordLength;
    
        return;
    }
    
    /*this could just as easily be done in-line with a program of this size */
    void processInput(struct DATA *myData){
        /*since this is not allocated as a 2d array, 
          the indexing into it requires some pointer math */
        int recordPointer = myData->currentRecord * myData->recordLengthMax;
    
        /*read in each line from the input file*/
        fgets(&myData->myData[recordPointer], myData->recordLengthMax, myData->fin);
    
        /*check for its type and count.
          NOTE:  this depends on the structure of relevant lines being 
          in this format:  <8 char time> <space> (ansyslmd) <space> <type>
          and may not really be valid 
          (example: "14:36:03 (ansyslmd) OUT: "ansys" chhwp@emd-sun74").
                                         ^^^^
          this is shown just for educational purposes 
          there are many other ways to do this */
        if (!strncmp(&myData->myData[recordPointer + TYPELOCATOR], "OUT:", 4))
            myData->numOUT++;
        else if (!strncmp(&myData->myData[recordPointer + TYPELOCATOR], "IN:", 3))
            myData->numIN++;
        else if (!strncmp(&myData->myData[recordPointer + TYPELOCATOR], "DENIED:", 7))
            myData->numDENIED++;
        else
            myData->numIGNORED++;
    
        myData->currentRecord++;
        return;
    }
    
    
    /*this is actually a rather dumb way to do this, it entails blowing
      through the data sets once for each output.  I leave it as an
      exersize to come up with a better method*/
    void processOutput(struct DATA *myData){
        int i, recordPointer;
    
        fprintf(myData->fout, "****************** OUT: records ******************\n");
        for (i=0; i<myData->numRecords; i++){
            recordPointer = i * myData->recordLengthMax;
            if (!strncmp(&myData->myData[recordPointer + TYPELOCATOR], "OUT:", 4))
                fprintf(myData->fout, "%s", &myData->myData[recordPointer]);
        }
    
        fprintf(myData->fout, "****************** IN: records ******************\n");
        for (i=0; i<myData->numRecords; i++){
            recordPointer = i * myData->recordLengthMax;
            if (!strncmp(&myData->myData[recordPointer + TYPELOCATOR], "IN:", 3))
                fprintf(myData->fout, "%s", &myData->myData[recordPointer]);
        }
    
        fprintf(myData->fout, "****************** DENIED: records ******************\n");
        for (i=0; i<myData->numRecords; i++){
            recordPointer = i * myData->recordLengthMax;
            if (!strncmp(&myData->myData[recordPointer + TYPELOCATOR], "DENIED:", 7))
                fprintf(myData->fout, "%s", &myData->myData[recordPointer]);
        }
    
        return;
    }
    
    
    /*if the main() routine is placed last, you don't need separate
      declarations and definitions for your routines.  it is just
      a programming style to have main() at either the top or the
      bottom of the file.  please don't put it in a random location!*/
    int main(){
        struct DATA myData;
        int i;
        FILE *tmp;
        time_t time_start, time_stop;
    
        time(&time_start);
    
        /*always a REALLY good idea to zero out values
          objects created on the stack start with essentially
          random values.  if you created this as a global variable
          the compiler is supposed to ensure it is zeroed, but that
          is not a good habit to get into.*/
        myData.numRecords = 0;
        myData.currentRecord = 0;
        myData.numOUT = 0;
        myData.numIN = 0;
        myData.numDENIED = 0;
        myData.numIGNORED = 0;
        myData.recordLengthMax = 0;
        myData.myData = NULL;
        myData.fin = NULL;
        myData.fout = NULL;
    
        if ((myData.fin = fopen("input.txt", "r")) == NULL){
            fprintf(stderr, "Can't open input.txt!\n");
            exit(1);
        }
        if ((myData.fout = fopen("output.txt", "w")) == NULL){
            fprintf(stderr, "Can't open output.txt!\n");
            fclose(myData.fin);/*since you are exiting here, not really necessary*/
            exit(1);
        }
    
        /*get the number of records and maximum record length*/
        checkInputFile(&myData);/*pass as pointer*/
    
        /*allocate memory for data buffer (recordLengthMax+1 is for NULL character)*/
        myData.myData = (char *) malloc(sizeof(char) * myData.numRecords * (myData.recordLengthMax+1));
        if (myData.myData == NULL){
            fprintf(stderr, "Error attempting to allocate %d bytes of memory!  Exiting!",
                                    sizeof(char) * myData.numRecords * (myData.recordLengthMax+1));
            fclose(myData.fin);
            fclose(myData.fout);
            exit(1);
        }
        fprintf(stdout, "Allocated %d bytes of memory for input buffer.\n", sizeof(char) * myData.numRecords * myData.recordLengthMax);
    
        /*reset file pointer back to begining of file*/
        rewind(myData.fin);
    
        /*read in each line and check it for type*/
        for (i=0; i<myData.numRecords; i++){
            processInput(&myData);
        }
    
        /*dump out each line based on its type*/
        processOutput(&myData);
    
        /*a little trick to output to multiple file streams
          without having to duplicate code */
        time(&time_stop);
        for (i=0; i<2; i++){
            if (i == 0) tmp = stderr;
            else tmp = myData.fout;
    
            fprintf(tmp, "Statistics of the input data:\n");
            fprintf(tmp, "\tNumber of records:\t%d\n", myData.numRecords);
            fprintf(tmp, "\tMax length of records:\t%d\n", myData.recordLengthMax);
            fprintf(tmp, "\tNumber 'IN:':\t%d\n", myData.numIN);
            fprintf(tmp, "\tNumber 'OUT:':\t%d\n", myData.numOUT);
            fprintf(tmp, "\tNumber 'DENIED:':\t%d\n", myData.numDENIED);
            fprintf(tmp, "\tNumber 'IGNORED:':\t%d\n", myData.numIGNORED);
            fprintf(tmp, "\n");
            fprintf(tmp, "Seconds to process:\t%d\n", time_stop - time_start);
        }
    
        /* always a great idea to code your 'free's and closes
           the same time you do the allocations, opens*/
        free(myData.myData);
        fclose(myData.fin);
        fclose(myData.fout);
        return 0;
    }

    My blog, The Fount of Useless Information http://sol-biotech.com/wordpress/
    Free code: http://sol-biotech.com/code/.
    Secure Programming: http://sol-biotech.com/code/SecProgFAQ.html.
    Performance Programming: http://sol-biotech.com/code/PerformanceProgramming.html.
    LinkedIn Profile: http://www.linkedin.com/in/keithoxenrider

    It is not that old programmers are any smarter or code better, it is just that they have made the same stupid mistake so many times that it is second nature to fix it.
    --Me, I just made it up

    The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.
    --George Bernard Shaw
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Posts
    33
    Rep Power
    12
    thanks, i think i may have found the problem.

IMN logo majestic logo threadwatch logo seochat tools logo