sasomao

622 Reputation

7 Badges

17 years, 224 days
Phd.
Paris, France

MaplePrimes Activity


These are replies submitted by sasomao

Hi guys,

I rerun 5 jobs in the same machine, and yet another time they all crashed, at the same time. This time 3 output ended with 

Execution Stopped: Unhandled signal caught (UNKNOWN: 1)

while the two others have not been written at the same time; but there were two core* files instead, created at the same time:

212976740 -rw-------  1 me invites     21968 août  27 11:59 out14.out
212976738 -rw-------  1 me invites     21968 août  27 11:59 out13.out
212976736 -rw-------  1 me invites     21968 août  27 11:59 out12.out
212976744 -rw-------  1 me invites  93229056 août  27 11:59 core.11163
212976743 -rw-------  1 me invites 109871104 août  27 11:59 core.11084

 

then the total is still 5. It really seems an issue with multiple jobs running at the same time

also, files like this one:

212976737 -rw-------  1 me invites 73911 août  27 10:42 .nfs00000000cb1c461000000e7

 

appear while the jobs are running 

Hi all,

 

I've some new evidences, maybe.
Yesterday I tried to run 5 times the same file, in the same computer (more often I use different computers) and they all crashed with core dump, unknonw error1.

Here the logs:

  tail -n 1 psiPi4epsilonPi2M2.8eta0.25*/log.txt==> psiPi4epsilonPi2M2.8eta0.253/log.txt <==Beginning h21 and v4 
==> psiPi4epsilonPi2M2.8eta0.25-4/log.txt <==Beginning h21 and v4 
==> psiPi4epsilonPi2M2.8eta0.25-5/log.txt <==Beginning h22 
==> psiPi4epsilonPi2M2.8eta0.25-6/log.txt <==Beginning h31
==> psiPi4epsilonPi2M2.8eta0.25-first/log.txt <==Beginning h31
==> psiPi4epsilonPi2M2.8eta0.25/log.txt <==Beginning h22 

Then actually there seems not to be a precise function that bothers the code. The reason must be another.

What is interesting is that all these jobs failed in the same minute:

212976735 -rw------- 1 me invites 21963 aoû 27 01:39 out5.out
212976225 -rw------- 1 me invites 21963 aoû 27 01:39 out4.out
212976224 -rw------- 1 me invites 21963 aoû 27 01:39 out3.out
212976223 -rw------- 1 me invites 21962 aoû 27 01:39 out2.out
212976221 -rw------- 1 me invites 21961 aoû 27 01:39 out1.out

Then is seems that the problem is not with a single job, but with the very fact of running multiple jobs at the same time. I will be running tests todaty, to verify if there is a systematic fail when more than 1 job is launched in the same machine.

Is there anything forbidding to launch (say) five jobs in the same pc, that could explain the crashes (let me remember that the programs have crashed after 5 hours)?
Should I put an particular option in my code, in this multi-runs? 

Tkhs

Salvo 

Hi Duncan,

All these jobs are runned into a Fedora 13 x86_64 machine. Additionally, what does it mean in this contest "compiled"??
The system administrator of our labo, installed Maple from the DVD (I guess it was a bash installer), then no real compilation has been performed.

I didn't build any library of mine. All the procedures I use in the file  are defined there.

And, if the problem was the architecture, wouldn't the problem manifeste at EVERY run, instead of this random way?

Salvo

Hi Duncan,

All these jobs are runned into a Fedora 13 x86_64 machine. Additionally, what does it mean in this contest "compiled"??
The system administrator of our labo, installed Maple from the DVD (I guess it was a bash installer), then no real compilation has been performed.

I didn't build any library of mine. All the procedures I use in the file  are defined there.

And, if the problem was the architecture, wouldn't the problem manifeste at EVERY run, instead of this random way?

Salvo

Hi Robert, all

I found your suggestion useful, and seeded my mpl file with log entries. Tonight I let three jobs running, and two  stopped with unkwnown error 1 etc.

Basically my program is composed of two parts: the first where I define a lot of procedures, and a second where two nested for run the procedures defined above, and append output to some txt files. 

The crashed jobs, they bot were running the same procedure:

Beginning the for[27,2]   #### 27 and 2 are the values of the FORs indexes
Beginning p_network 
Letting p_network 
Beginning h1_network
 Letting p_network 
Beginning snr2 
Letting snr2 
Beginning low_fisher
 Letting low_fisher
 Beginning up_fisher
 Letting up_fisher 
Beginning crlb's calculations 
Letting crlbs 
Beginning h2 network 
Letting h2 network
 Beginning h3 network 
Letting h3 network 
Beginning h31
Letting h31 
Beginning h22 

for one, and 

Beginning the for[0,5]
Beginning p_network 
Letting p_network 
Beginning h1_network
 Letting p_network 
Beginning snr2 
Letting snr2 
Beginning low_fisher
 Letting low_fisher 
Beginning up_fisher 
Letting up_fisher
 Beginning crlb's calculations 
Letting crlbs 
Beginning h2 network 
Letting h2 network
 Beginning h3 network
 Letting h3 network
 Beginning h31
Letting h31 
Beginning h22  

for the other. So h22 seems to be the problem.  I was very happy with that, but only for some seconds.

I tried to reproduce the error in a graphical interface, xmaple, (in a different PC),  copying the code, and starting the FORs with the guilty values, 27 and 2 of the first run. But it simpy works, without any problem. h22 is calculated and all is fine... The problem seems to happen randomly. How am I supposed to fix it if I cannot make it happen????

Hi Robert, all

I found your suggestion useful, and seeded my mpl file with log entries. Tonight I let three jobs running, and two  stopped with unkwnown error 1 etc.

Basically my program is composed of two parts: the first where I define a lot of procedures, and a second where two nested for run the procedures defined above, and append output to some txt files. 

The crashed jobs, they bot were running the same procedure:

Beginning the for[27,2]   #### 27 and 2 are the values of the FORs indexes
Beginning p_network 
Letting p_network 
Beginning h1_network
 Letting p_network 
Beginning snr2 
Letting snr2 
Beginning low_fisher
 Letting low_fisher
 Beginning up_fisher
 Letting up_fisher 
Beginning crlb's calculations 
Letting crlbs 
Beginning h2 network 
Letting h2 network
 Beginning h3 network 
Letting h3 network 
Beginning h31
Letting h31 
Beginning h22 

for one, and 

Beginning the for[0,5]
Beginning p_network 
Letting p_network 
Beginning h1_network
 Letting p_network 
Beginning snr2 
Letting snr2 
Beginning low_fisher
 Letting low_fisher 
Beginning up_fisher 
Letting up_fisher
 Beginning crlb's calculations 
Letting crlbs 
Beginning h2 network 
Letting h2 network
 Beginning h3 network
 Letting h3 network
 Beginning h31
Letting h31 
Beginning h22  

for the other. So h22 seems to be the problem.  I was very happy with that, but only for some seconds.

I tried to reproduce the error in a graphical interface, xmaple, (in a different PC),  copying the code, and starting the FORs with the guilty values, 27 and 2 of the first run. But it simpy works, without any problem. h22 is calculated and all is fine... The problem seems to happen randomly. How am I supposed to fix it if I cannot make it happen????

Hi Acer,

the code is some 2000 lines long now (although I guess I could remove all the comments and some indexing function,  going below the 1500 lines).

Would it be worth posting it?

Salvo

Hi all,

the problem is still there, today I had two jobs stopped after 7-8 hours, with this misterious unknown signal, an the core file.

I tried to give a look a a core file, most of it is binary garbage, but some lines in in english. Between the others, the was this strange warning:

 

machine is big endian but maple was not compiled so

 

buried in a lot of binary symbols, I don't know if this can be ralated with the problem. 

 

Btw, I'm trying to run the program with printlevel=10, but this screw up all my ouput files, created in the mpl file using commands like:

appendto(cat(new_dir_name,"/crlb-seq-theta.txt")):
printf("crlb_theta=[ ");

unfortunately, with the priintlevel the created file is not as it should, but contain a lot of printlevel lines. Is there no a way to avoid that?

 

Thanks

Salvo

Hi,

ok for the delay, I understand. I'm a linux user, and I'm using chrome to surf.

I cleared the cache, but I still don't see my answer at my older thread. I have not understood what you told about the replies not going at the top of the stack. I went to my older thread, pressed "reply" to the last message there, and replied. If it doesn't go at the top of the last questions how can people reply to it?

Thanks

Salvo

Hi,

ok for the delay, I understand. I'm a linux user, and I'm using chrome to surf.

I cleared the cache, but I still don't see my answer at my older thread. I have not understood what you told about the replies not going at the top of the stack. I went to my older thread, pressed "reply" to the last message there, and replied. If it doesn't go at the top of the last questions how can people reply to it?

Thanks

Salvo

Hi All,

I dig out this thread of mines, because I still have the same problem with this UNKNOWN 1 

 

I avoid that I've not understood what assertlevel does, I've tried to set it on 2, but I don't see any difference in a fake program I've created ad hoc.

The problem with "trace" is that I don't know which functions is crashing, the error warning doesn't tell that. I cannot trace all the functions, there a lot of them (the program is 1800 lines long)
Idem with Infolevel, I should modify all the procedures. 

The problem with printlevel is that it really prints lots of output, and my garbage file becomes too big and unreadable. 

Is there no a way to understand in which procedure the errors begins, before starting the real debug?????

 

Thanks

Salvo

Hi Axel, hi all

how should I modify the integration routine  in order to have a still more precise answer?

 

Thanks

Salvo

Hi Axel, hi all

how should I modify the integration routine  in order to have a still more precise answer?

 

Thanks

Salvo

Thanks!
1 2 3 4 5 6 7 Last Page 2 of 10