Discussion Forum

Forum Home

New Discussion

New Discussion

Note: This discussion is about an older version of the COMSOL Multiphysics^® software. The information provided may be out of date.

Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.

Comsol Cluster

Posted May 8, 2011, 1:02 a.m. EDT Studies & Solvers Version 4.1 4 Replies

Matt Chabalko

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Hello,

I've been trying to get comsol to run on our Linux cluster to solve some larger models (ideally, 6-12 MDoF's). I'm at a point, however, where I could use some advising.

Any model I can solve on 1 computer I can also currently solve using more than 1 node. (So far tried up to 20 nodes out of 30 available). The problems come when I try the larger problems I'm really shooting for.

I'll list the things that have me puzzled in order of how straightforward they are (I think ;) ...). I'm using comsol 4.1 and a Linux cluster. Machines have identical hardware (specifically, 16 GB RAM).

1. In my choice of solvers, is Pardiso usable for a shared memory implementation? Reading other posts here, it comes up in cluster context, but I'm not sure if that's ok for shared memory. As far as I knew, comsol says MUMPS and SPOOLES only for this.

2. I once solved a problem with 1.7 MDoFs on one machine. It took about 48 hours, but it completed nonethelesss (I'm fine with waiting that long). I tried a similar model with 3.7 MDoFs, 20 nodes, but to no avail. The solvers (tried MUMPS and SPOOLES both) get close to finishing ~90% completion, then the log file starts printing some confusing mpi errors (exit status of rank 18: return code 13....rank 17 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks).

That ece005.ece.cmu.edu is the node I launch the server on. It always seems to be more active than any other of the compute nodes, and it seems to be the troublemaker amongst the group as it is listed in the log in that way as what I assume to be the root of the job quitting.

Also, monitoring it with top command in linux, it's the only compute node using more than 100% processor usage during ALL the TIME (the others jump up to >1000% during matrix assembly and other such portions of the solution process more cores can be utilized).

I'm just kinda having a hard time swallowing the thought that, while one computer can solve 1.7 MDoFs, then 20 nodes could not complete a problem that's only slightly less than double the DoFs.......I know the relation between memory requirements and DoFs might not be linear, but still......

3. In general, with, let's say 8 GB available to me (because the cluster is shared resource, more often than not I might not have 16 GB to play with) and 30 nodes running (for a grand total of 240 GB usable RAM) what is a reasonable number to put to how large a model I could solve (in Dofs anyway)...

If i'm going to keep debugging I'd like to have a feel who the actual culprit is here....If someone can tell me "... at 240 GB shared memory, you'll never get past ...#dofs...." that's be a big help too. Then I know that horesepower either is or is not an inherent limitation and look elsewhere (comsol pointed me to stack size, which I'm quite lost on also...and there may be firewall issues I haven't sorted out yet).....

Ok, in any event, any input, suggestions, etc that anyone can give me would be a HUGE help! Thanks a million!!

--Matt

4 Replies Last Post Mar 11, 2012, 10:52 p.m. EDT

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago

I have been running some large problems on a RHEL linux cluster here without problems, so perhaps I can help you a little.

1. Only MUMPS I have used. If you run out of memory while using MUMPS in a parallel job, then the entire job will fail. You may want to monitor your memory usage during the solve. This is a bug. MUMPS should start to page to the disk when it runs out of memory (use your virtual memory) but instead it fails. I understand this is to be fixed in v4.2.

2. See the paper i wrote in the conference to see parallel processing performance I obtained on an 8-node cluster. I have a 12-node now with 128 computer cores, each node with 64 GB ram. The direct solver uses a lot of memory, and more nodes will reduce your memory usage, but it is still a lot. A 4.4 Mdof problem uses about all 64 GB ram on each of 3 compute nodes.

3. make sure your mpi set up is stable. what OS are you using (distribution of linux). we have only used RHEL here and it is designed for a cluster. I think that is all COMSOL is guaranteed to work with since that is what they have.

4. Heard today about v4.2 It will have a much-improved distributed parallel processing capability including parallel assembly and iterative solvers. Good news for people with clusters ! don't know much more than that or even if PARDISO will distribute or not.

I think you are running out of memory. Monitor your memory usage as the job runs. It can build up with each iteration if parts or all of your solution are saving to memory. Do you have Ganglia installed ? Look at that and monitor your memory on all the compute nodes. I bet you will be getting near the 16GB on one of the nodes while it is running and cause the MUMPS failure. We make sure nothing else if running with COMSOL while in parallel (all to ourselves, lock out nodes while running). We share entire nodes between applications.

Sirisha Govindaraju

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago

James,

In one other posts (I don't remember the link) you wrote that when using a direct MUMPS solver, the approximate RAM used is about 20GB for a million DOF.

However in this post, you wrote that for 4.4mDOF, you need 64*3GB RAM!

Which of these two numbers is a better estimate?

Thanks,
Sirisha

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago

Dear Sirisha,

I don't recall from memory what you are asking about, but of course, I believe what you are saying. Let me repeat for clarity

1st case:

1 Mdof requires 20 GB ram

2nd case

4.4 Mdof requires 3*64=192 GB ram

I am guessing that your deeper question might be why

4.4/1 .ne. 192/20 ?

I will attempt to answer.

The first thing to keep in mind is that each problem is different and requires different amounts of memory and depends on several factors including: 1) the physics involved (heat transfer, structural mechanics, CFD, etc.), 2) how dense your mesh is (coarse, normal, fine, etc.), 3) the finite element basis (linear, quadratic, etc.), 4) complexity of the physics coupling (union or assembly of geometry, identity and contact pairs, time dependence, coupling variables, etc.), 5) method of solver (direct, iterative, mumps, etc.), etc.

So, let's assume the two examples you cite above are exactly the same physics and only the mesh density is different. Case #1 above requires only 20 GB of RAM and will fit within a single compute node of a cluster and does not require any additional compute nodes. Therefore, case #1 does not require distributed parallel processing (DPP). Case #2 on the other hand, is so large that it will not fit on a single compute node (our cluster is limited to 64 GB per compute node). Once you decide to go to a DPP, you now introduce a certain level of overhead that you must contend with in order to use DPP. In the example above, approximately 100% overhead is required to solve the problem if the entire memory of all 3 machines is required. If there were no overhead at all involved in a DPP solution, it makes sense that it would take at least two 64 GB nodes to solve a 4.4/1*20=88 GB ram problem. Perhaps 3 compute nodes were chosen in order to solve the problem faster or the additional overhead involved in DPP requires 3 compute nodes.

I believe if you read the papers carefully where I have published my DPP performance with COMSOL, you will find the specific conditions of performance and the exact problem. Further, I believe the conference CDs also include the specific model files I used with enough information so that the result can be repeated for the same hardware and COMSOL version number.

Dear Sirisha, I don't recall from memory what you are asking about, but of course, I believe what you are saying. Let me repeat for clarity 1st case: 1 Mdof requires 20 GB ram 2nd case 4.4 Mdof requires 3*64=192 GB ram I am guessing that your deeper question might be why 4.4/1 .ne. 192/20 ? I will attempt to answer. The first thing to keep in mind is that each problem is different and requires different amounts of memory and depends on several factors including: 1) the physics involved (heat transfer, structural mechanics, CFD, etc.), 2) how dense your mesh is (coarse, normal, fine, etc.), 3) the finite element basis (linear, quadratic, etc.), 4) complexity of the physics coupling (union or assembly of geometry, identity and contact pairs, time dependence, coupling variables, etc.), 5) method of solver (direct, iterative, mumps, etc.), etc. So, let's assume the two examples you cite above are exactly the same physics and only the mesh density is different. Case #1 above requires only 20 GB of RAM and will fit within a single compute node of a cluster and does not require any additional compute nodes. Therefore, case #1 does not require distributed parallel processing (DPP). Case #2 on the other hand, is so large that it will not fit on a single compute node (our cluster is limited to 64 GB per compute node). Once you decide to go to a DPP, you now introduce a certain level of overhead that you must contend with in order to use DPP. In the example above, approximately 100% overhead is required to solve the problem if the entire memory of all 3 machines is required. If there were no overhead at all involved in a DPP solution, it makes sense that it would take at least two 64 GB nodes to solve a 4.4/1*20=88 GB ram problem. Perhaps 3 compute nodes were chosen in order to solve the problem faster or the additional overhead involved in DPP requires 3 compute nodes. I believe if you read the papers carefully where I have published my DPP performance with COMSOL, you will find the specific conditions of performance and the exact problem. Further, I believe the conference CDs also include the specific model files I used with enough information so that the result can be repeated for the same hardware and COMSOL version number.

Sirisha Govindaraju

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago

Dear James,

Thanks for the email. I have not read your papers. Will you be able to provide me with a link to those papers or a soft copy of the papers? I want to read them.

Have you ever come across a situation where COMSOL uses more virtual memory even when the physical memory is available? I am running into such a situation and I think this is slowing down my solution process.

Please email me the papers where you talk about the DPP.

I thought that the density of the mesh, the kind of elements chosen, direct solver etc all of these boil down to the DOF and it is this DOF that drives the RAM requirement. Am I wrong in thinking so?

Thanks,
Sirisha

Note that while COMSOL employees may participate in the discussion forum, COMSOL^® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.

Suggested Content

KNOWLEDGE BASE The MUMPS Solver Crashes or Hangs During Cluster Simulation
KNOWLEDGE BASE Running COMSOL® in Parallel on Clusters
BLOG How to Use the Cluster Sweep Node in COMSOL Multiphysics®
BLOG How to Run on Clusters from the COMSOL Desktop® Environment
FORUM Missing data after cluster run