#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    48
    Rep Power
    1

    Smile Multi-thread file reading


    I expect this will be a hard question you don't have to reply today.
    I'm creating a program that uses multiple threads to read from a file, manipulate the data, and put it into another file in the same order in which it was read.
    I'm not certain how to do this at all. You see one thread reading from the file would not be able to keep up with several threads (several i.e. an unspecifiable amount limited only to the number of CPUs/cores.)
    Now, if I had my threads read from the file they would need to either:

    1: Lock it.
    or
    2: Read and get the position in the file at the same time.

    (by the way I am doing low level file descriptor reading and writing)
    Now if I locked the file for reading I would also have to at the same time lock the file for writing at the same time which might not work.
    Even if it did it would eventually degenerate into a "lock the files war." Like having a half duplex network running at over 2/3 load.
    The second option therefore, seams better, but how do you tell the compiler to execute the tell/seek functions at the same time?
    I've brainstromed as best as I am able but can't seem to figue out a good method of doing this.
    I'm running GCC 4.7.x
    Don't bother with cross platform stuff, I'm not, at least currently interested.
    :confused:
    As always, thanks for your insight.
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,709
    Rep Power
    480
    split the file, spread it among multiple disks, process the parts with multiple processes, cat the results to a single output. Let the operating system deal with the multiple processes.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    May 2004
    Posts
    3,417
    Rep Power
    886
    Your compiler is almost irrelevant for this topic. What file system(s) are we talking about here? Some file systems support record or range locking, some don't. Some operating systems provide a level of transaction support.

    One design approach might be to implement an in-memory read-ahead cache that one thread feeds as the other threads then process the data. You'd have to have some coordination between the processing threads and the reader so that it could know which source data it can discard.
    I no longer wish to be associated with this site.
  6. #4
  7. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,595
    Rep Power
    4207
    1. Declare two mutexes

    2. Write a read_chunk() function that locks the first mutex, then reads a chunk from the input file and increments a read chunk counter. This function should also record the file offset that it read the data from (either using tell() or a static variable) and return that along with the chunk. Before returning the data, it should unlock the first mutex. Call this function from all the threads to acquire chunks of data. That way you have a thread safe function that can be used to allocate data to various threads.

    3. Write a fill_output_chunk() function. This should lock the second mutex and copy the chunk as well as its output offset to an output buffer. It should also increment a write chunk counter. This function should unlock the second mutex when done. This function should be called by the various threads when they are doing processing the chunk they were given.

    4. When read_chunk_counter == write_chunk_counter and our threads are all done, our output buffer is ready to be written. So we just go through the chunks in the output buffer and write them out in order of the offsets.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    48
    Rep Power
    1

    Thumbs up


    Originally Posted by Scorpions4ever
    1. Declare two mutexes

    2. Write a read_chunk() function that locks the first mutex, then reads a chunk from the input file and increments a read chunk counter. This function should also record the file offset that it read the data from (either using tell() or a static variable) and return that along with the chunk. Before returning the data, it should unlock the first mutex. Call this function from all the threads to acquire chunks of data. That way you have a thread safe function that can be used to allocate data to various threads.

    3. Write a fill_output_chunk() function. This should lock the second mutex and copy the chunk as well as its output offset to an output buffer. It should also increment a write chunk counter. This function should unlock the second mutex when done. This function should be called by the various threads when they are doing processing the chunk they were given.

    4. When read_chunk_counter == write_chunk_counter and our threads are all done, our output buffer is ready to be written. So we just go through the chunks in the output buffer and write them out in order of the offsets.
    OGH!!!! That's Clever!! :D
    No I/O fights because there's a pause inbetween disk access. A little more memory use though (still I expected to have to have to create a buffer of some sort.) Now to invent a chunk method....

    split the file, spread it among multiple disks, process the parts with multiple processes, cat the results to a single output. Let the operating system deal with the multiple processes.
    That would work but it would make a copy of the file (if you kept the original of course.) For small files that might not be a problem but for large files it would use a lot of access time and I/O. I intend to make my program work with any size of file.

    I had hoped to keep it file system independent.
    I think that's my questions pretty much answered.

IMN logo majestic logo threadwatch logo seochat tools logo