Parallel loop possibly starting duplicate thread?

pthompson · April 15, 2019, 5:17pm

I moved to build 10.0.0.2391 recently and I have a piece of problem code that was deployed last Friday. About 4 runs later (twice daily batch processing), this issue seems to have arisen. We’ve run this processing load twice daily for the last 3 years and not had this problem reported before - and the users would be very aware had it happened.

What I seem to be seeing is a parallel loop executing twice for a given item in a collection :-

	for parallel lLoopA:Integer := 0 to cSiteMaster.Items.Count -1 do
	begin
		var lSiteMasterItem		:	KKDSiteMasterItem;

		lSiteMasterItem := cSiteMaster.Items[lLoopA];
		PutLog(String.Format('Starting : {0}',[lSiteMasterItem.Description]));

The PutLog method is a straight forward text logger to the console and a text file :-

class method Program.PutLog(const aValue:String);
begin
	Console.WriteLine(aValue);
	cLogFile.WriteLine(String.Format('{0} : {1}',[DateTime.Now.ToString('dd MMM yyyy HH:mm:ss'),aValue]));
end;

cSiteMaster is a class level object that has a list of items within it. Once loaded from a database (which happens earlier in the process), that list is not sorted, or accessed anywhere else, other than to iterate through the list and process each “site”.

For some reason the second run today logged two “Starting :” entries for the same entity. The “Bristol” site is also the first in the list.

15:42:30	Starting : Bristol
15:42:30	Starting : Bristol
15:42:50	Progress 11.87% ( 1,496 of 12,600 for Process Bristol)
15:43:12	Progress 51.78% ( 6,524 of 12,600 for Process Bristol)
15:43:32	Progress 87.10% (10,974 of 12,600 for Process Bristol)
15:43:52	Progress 95.92% (12,086 of 12,600 for Process Bristol)
15:43:52	Finished : Bristol

The subsequent logging shows no signs however of the duplicate running through the processing. In theory that could be my code not picking up an exception, but it should do - and has done in the past when issues have arisen inside parallel code.

All other sites in the list process normally and have reported their correct outputs. The “Bristol” site however produced a mixed set of results that I cannot explain other than data corruption in the processing i.e. data is potentially being lost / overwritten due to the dual processing of the site.

A subsequent run of the process did not show the problem, and neither did the 3 runs prior to the problem run, all of which used the .2391 build. It does seem like a big coincidence that this has not shown up in the 3 past years though with the code running twice daily, every day.

Any thoughts on whether this may be related to the .2391 build and use of “parallel” ?

mh · April 15, 2019, 5:33pm

I don’t believe anything has changed here recently. What build did you use before?

Can you reproduce this in a smaller testcase, or only in the context of your application? Can you try logging the actual counter, as well? Also, I assume you are positive that the actual collection does not change while the loop us running (or just contains two similar/identical) entries? Just like a regular non-parallel for loop, the parallel for loop would evaluate the Count once, before starting the first loop run.

I assume the Starting, Progress and Finished messages come from within a single loop iteration, correct? Yet I understand you only see the first message duplicated, but you don’t actually see a “full run” for that dupe looped (ie no second “Finished: Bristol”), correct?

Also, is cLogFile.WriteLine itself thread safe (ie does it use locking, to around two threads writing to the file at the same time)?

pthompson · April 15, 2019, 6:08pm

Thanks for the reply Marc. If you’re not aware of any obvious issues with that release, I won’t pursue it at the moment. It does seem like too much of a coincidence that I’ve just started deploying code with that build though… Prior to .2391, I would have been using 8.1.83.1751 for a very long time.

It does seem suspicious that I’m not seeing any other logging for that phantom thread, which I was expecting given the two “Starting” messages. However, just those messages you’ve seen, which look to be from just a single instance of the process for “Bristol” i.e. the counters all head towards completion with no second phantom thread jumping backwards / interrupting the sequence. I do get all messages for the other 15 of so items in the container correctly. Only one “Starting” for each and they all process through with no sign of an issue. The program successfully runs to completion.

I’m assuming the cLogFile.WriteLine is thread safe - I’ve not had any issues with it. It is just a class level System.IO.StreamWriter , with AutoFlush set to true.

And to confirm nothing touches or otherwise accesses that cSiteMaster list once it has been loaded - which has well an truly completed before we get to the processing part. That list does get loaded with “async” , along with a number of other pieces of data, but completion must happen before progressing to processing. The loaded record counts are reported on completion of each object being populated. The cSiteMaster one, along with others must also complete before some subsequent ones are loaded i.e. there are dependencies that mean it must be loaded first. So, I’m as sure as I can be that the data load is completed and there are no other accessors of the data once it is loaded.

As for being able to reproduce it, I really doubt it. If I could, I’d be trying that before posting here. It’s a fairly complex process and does have to do thread safe access to a few things (with locking). I can readily make it non parallel, but then it will take about 45 minutes to run instead of the 4 it currently does (I really do like the parallel processing) and I do tend to do that for debugging other areas when necessary.

As I’ve jumped from build .1731 to .2391, I suppose it in theory could be something anywhere in that sequence and not specifically in .2391.

If there’s nothing jumping out from your side Marc, I’m happy to leave it there for now and see if it repeats in the coming days / weeks. I’ll use this thread as a marker for the future if necessary.

mh · April 15, 2019, 8:14pm

Note though that when I say “recent changes” I don’t necessarily mean going back three years ;). a lot has changed in all areas of the compile since 8.1. Just FTR.

Assuming and thread safety should never go into the same paragraph ;). I suggest putting a 'locked on typeOf(self)onProgram.PutLog` and see if that makes any difference. Then I’d start looking if there’s any other evidence that the code actually runs twice (ie what “processing” does this actually do — does the actual output look like anything happenend twice?

also maybe add something like

var I := 0;
for parallel lLoopA:Integer := 0 to cSiteMaster.Items.Count -1 do begin
  interlockIncrement(i);

to the code. after the loop, is I the expected count, or is it higher?

Patrick · April 16, 2019, 6:49am

Hello,
I don’t know what type of application you have and how the application started, but I had something this in a web site, where a background thread is started once each week and it is tested when the application starts.
The culprit is IIS that sometimes recycles the application just after it started, because of the memory usage.
I suggest that you add an entry in the log with an indication that the web service has started.

pthompson · April 16, 2019, 8:20am

Thanks Marc.

Nothing unusual to report from the overnight run, all worked as expected. I’m guessing this will be an infrequent to very infrequent issue.

I’ll implement your suggestions, particularly the interlockincrement. That should give me a fairly robust indicator that the loop is misbehaving. I’ll set the exit code from the application to indicate failure if the counter is not equal to the collection count after the processing. The thread safety of the PutLog, yes I’ll do that too. I’m a little more sceptical about that one. If we assumed it was misreporting due to being non thread safe , fair enough, but that would largely just mean that the message reported was a duplicate / echo from another thread. My simple mind says the actual processing going on shouldn’t be impacted by that. Given that the reported “Bristol” instance was also the one with problematic outputs, it again seems a bit of a stretch.

On the wider point about evidence that it’s done the processing twice. Well, impossible to tell now as the code has run in production twice since and the data has been refreshed (successfully). The processing that goes on is not rocket science (it’s logistics / stock planning) but trying to work out what had actually happened would be trying to reverse engineer you omelette to find out what the chicken was thinking when it laid the egg. There were thousands of data points affected (seemingly missing largely), and I don’t think I could stand back far enough to get a perspective on what the commonality / route through the code would have been to get to that state. If the problem readily repeated then maybe, but even then with the parallel nature of this, that makes it different every run.

Anyway, I’ll implement the modifications and see if that picks up anything in the coming days / weeks.

Thanks again.

pthompson · April 16, 2019, 8:25am

Hi Patrick. Thanks for the reply, but this is not IIS related. The application is a straight forward batch data processing console application, launched on a schedule from SQL server. The application has been running twice a day for about 3 years now and has had a few issues (some related to threading and elements of my code not being thread safe), but I’ve never seen this one before. It does just seem somewhat of a coincidence that this has arisen 4 runs into delivering code built with a big jump in the elements compiler.

pthompson · April 16, 2019, 3:16pm

Marc,

I think this boils down to a case of mea cupla. One more piece of information came to light this morning, specifically one more site reported some anomalies. When I looked at some of the data left behind by the process, it did look like records for two “sites” were being lost. This triggered a memory from around a year back that we had seen this before. Talking to a colleague at the client he also remembered this. At the time I think we understood the issue and filed it in the “lucky to have it happen again in the next X years / more likely to win the lottery” category. Well, I think we won the lottery with the run yesterday.

Specifically (and ironically), it probably boils down to non thread safe access of a global resource. At the end of each site threads processing, the results from each thread are added into a global list. That global list is a “List” and the .Add method is used to add in the thread list of produced data. I think this is a simple case of two threads finishing within a few milliseconds of each other and clashing when dumping their results into the global list (the global list is then ultimately BCP’d onto the database for speed).

locking cLocationPickList do
for each lLocationPickItemA : KKDLocationPickItem in lLocationPickList do
		cLocationPickList.Add(lLocationPickItemA);

So the cLocationPickList is the global list and the lLocationPickList is the thread specif local variable into which a site data is loaded. In the above i’ve now added the “locking”, which I really should have done 6-9 months back when we saw what we now believe is the same issue. Hence the mea cupla…

I’ve added in

Interlocked.Increment(var lSiteCheckCounter);

as a safety check for comparison against the site item list count, but can you clarify the “locked on” for the “PutLog” ?

class method Program.PutLog(const aValue:String); locked on typeOf(Self);

I had thought I understood the definition needed to be like the above, obviously not, as that generates a compiler error. As we did get the two “Starting” messages for one Site yesterday, there is obviously still something amiss there and I should make sure that is thread safe as you suggested.

mh · April 16, 2019, 3:50pm

Cool. that that’s it!

Ouch, yeah. thats the kind if stuff you definitely want locked.

pthompson:

class method Program.PutLog(const aValue:String); locked on typeOf(Self);
I had thought I understood the definition needed to be like the above, obviously not, as that generates a compiler error. As we did get the two “Starting” messages for one Site yesterday, there is obviously still something amiss there and I should make sure that is thread safe as you suggested.

That should be fine, but you’ll need to put the locked... directive on the declaration, not on what looks like the implementation.

It might be you need to stop the typeOf() around self, for a static/class method (my bad).

pthompson · April 16, 2019, 4:37pm

I’ve switched the “locked on” to be on the definition and not the implementation and it now compiles and runs. I should satisfy myself that it is locking and produce a little test program to give the “locked on” a good stress testing (a dozen threads all constantly writing to it), but one to add to the to do list for now.

I didn’t need to remove the typeof :-

type
  Program = class
  public
      class method PutLog(const aValue:String); locked on typeOf(Self);

All looks OK so far, so we’ll see how it performs from here on in.

Thanks again Marc.

mh · April 16, 2019, 4:47pm

I’m not sure, TBH. it could be that on a static method typeOf() goes one Leven to high (ie not thew type of the instance, but the type of the metadata type — which would then lock on System.Type. it’s still safe, but this would lock globally with other (static) places that also use locked on typeOf(self)

an alternative for performance (if 100% linearity isn’t. concern) might be to mark PutLog as async, and lock inside it. that way, PutLog won’t block the actual execution.

But I’d only go there if the lock does turn out to become a performance issue.

pthompson · April 16, 2019, 5:12pm

I’ll try and remember the distinction, thanks.

I see what you’re saying, and for my purposes I doubt it will ever be an issue as the logging is inconsequential in comparison to the processing, so even if a thread (or threads) have to wait a millisecond for another thread to release the lock, it won’t make any difference. Would be interesting to see the effect of using async instead though.I would have thought the effect would still be linear output in time order ? The program execution would not be halted, just a potential queue of PutLog methods awaiting execution whilst the processing continues ? Would that “queue” of PutLog async executions actually get executed in order of submission from the threads or would they be “randomly” executed. In the former, then a linear time output would still be maintained even as the program execution continued not waiting for the PutLogs to return. In the latter, I’d expect to see a mixing of the sequence - some earlier times appearing in the log after later times etc. ?

All this is bringing back vague student memories of writing a multi threaded disc access simulation, implemented with queues in Ada.