April 26th, 2013, 12:47 PM
Multi-thread file reading
I expect this will be a hard question you don't have to reply today.
I'm creating a program that uses multiple threads to read from a file, manipulate the data, and put it into another file in the same order in which it was read.
I'm not certain how to do this at all. You see one thread reading from the file would not be able to keep up with several threads (several i.e. an unspecifiable amount limited only to the number of CPUs/cores.)
Now, if I had my threads read from the file they would need to either:
1: Lock it.
2: Read and get the position in the file at the same time.
(by the way I am doing low level file descriptor reading and writing)
Now if I locked the file for reading I would also have to at the same time lock the file for writing at the same time which might not work.
Even if it did it would eventually degenerate into a "lock the files war." Like having a half duplex network running at over 2/3 load.
The second option therefore, seams better, but how do you tell the compiler to execute the tell/seek functions at the same time?
I've brainstromed as best as I am able but can't seem to figue out a good method of doing this.
I'm running GCC 4.7.x
Don't bother with cross platform stuff, I'm not, at least currently interested.
As always, thanks for your insight.
April 26th, 2013, 01:47 PM
split the file, spread it among multiple disks, process the parts with multiple processes, cat the results to a single output. Let the operating system deal with the multiple processes.
[/code] are essential for python code and Makefiles!
April 26th, 2013, 01:57 PM
Your compiler is almost irrelevant for this topic. What file system(s) are we talking about here? Some file systems support record or range locking, some don't. Some operating systems provide a level of transaction support.
One design approach might be to implement an in-memory read-ahead cache that one thread feeds as the other threads then process the data. You'd have to have some coordination between the processing threads and the reader so that it could know which source data it can discard.
I no longer wish to be associated with this site.
April 26th, 2013, 02:04 PM
1. Declare two mutexes
2. Write a read_chunk() function that locks the first mutex, then reads a chunk from the input file and increments a read chunk counter. This function should also record the file offset that it read the data from (either using tell() or a static variable) and return that along with the chunk. Before returning the data, it should unlock the first mutex. Call this function from all the threads to acquire chunks of data. That way you have a thread safe function that can be used to allocate data to various threads.
3. Write a fill_output_chunk() function. This should lock the second mutex and copy the chunk as well as its output offset to an output buffer. It should also increment a write chunk counter. This function should unlock the second mutex when done. This function should be called by the various threads when they are doing processing the chunk they were given.
4. When read_chunk_counter == write_chunk_counter and our threads are all done, our output buffer is ready to be written. So we just go through the chunks in the output buffer and write them out in order of the offsets.
Up the Irons
What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
"Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
Down with Sharon Osbourne
"I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
April 26th, 2013, 02:26 PM
OGH!!!! That's Clever!!
Originally Posted by Scorpions4ever
No I/O fights because there's a pause inbetween disk access. A little more memory use though (still I expected to have to have to create a buffer of some sort.) Now to invent a chunk method....
That would work but it would make a copy of the file (if you kept the original of course.) For small files that might not be a problem but for large files it would use a lot of access time and I/O. I intend to make my program work with any size of file.
I had hoped to keep it file system independent.
I think that's my questions pretty much answered.