Igniters,
I am working on the stability of our TC test runs. Some of our execution timeouts (hangings, unexpected stops) happen because of issues in source code: test itself, test runners, configurations, bug, Linux OOM killer and so on. We could fix them by changing code. But almost all of the last issues with timeouts have happened because many tests ran disk-intensive operations on one machine. Examples: https://ci.ignite.apache.org/viewLog.html?buildId=1543562&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_ZooKeeperDiscovery2 https://ci.ignite.apache.org/viewLog.html?buildId=1543518&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Basic1 and so on. To fix this problem I propose to extract from "Run Basic" and "Run Cache" new dedicated ones for persistent tests TC configurations. Also, I would add some checking to not allow add new tests with persistent to other TC configurations in future. It would allow us to run almost all TC configuration on any agent while configurations with persistent would have agent rules to not get a timeout. Thoughts? |
Ed,
We already discussed this some time ago. AFAIK SSD disks do not have this problem, so all we need is to replace HDD with SSD. On Fri, Jul 27, 2018 at 3:26 PM Eduard Shangareev < [hidden email]> wrote: > Igniters, > > I am working on the stability of our TC test runs. > > Some of our execution timeouts (hangings, unexpected stops) happen because > of issues in source code: test itself, test runners, configurations, bug, > Linux OOM killer and so on. > > We could fix them by changing code. > > But almost all of the last issues with timeouts have happened because many > tests ran disk-intensive operations on one machine. > > Examples: > > > https://ci.ignite.apache.org/viewLog.html?buildId=1543562&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_ZooKeeperDiscovery2 > > https://ci.ignite.apache.org/viewLog.html?buildId=1543518&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Basic1 > > and so on. > > To fix this problem I propose to extract from "Run Basic" and "Run Cache" > new > dedicated ones for persistent tests TC configurations. > > Also, I would add some checking to not allow add new tests with persistent > to other TC configurations in future. > > It would allow us to run almost all TC configuration on any agent while > configurations with persistent would have agent rules to not get a timeout. > > Thoughts? > |
Vladimir,
I am talking only about Run Cache and Basic. I don't see any objection why we couldn't do so. Even with extra SSD, it could be worth to split configuration to have control over their impact on the disk system. On Fri, Jul 27, 2018 at 3:28 PM, Vladimir Ozerov <[hidden email]> wrote: > Ed, > > We already discussed this some time ago. AFAIK SSD disks do not have this > problem, so all we need is to replace HDD with SSD. > > On Fri, Jul 27, 2018 at 3:26 PM Eduard Shangareev < > [hidden email]> wrote: > > > Igniters, > > > > I am working on the stability of our TC test runs. > > > > Some of our execution timeouts (hangings, unexpected stops) happen > because > > of issues in source code: test itself, test runners, configurations, bug, > > Linux OOM killer and so on. > > > > We could fix them by changing code. > > > > But almost all of the last issues with timeouts have happened because > many > > tests ran disk-intensive operations on one machine. > > > > Examples: > > > > > > https://ci.ignite.apache.org/viewLog.html?buildId=1543562& > tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_ZooKeeperDiscovery2 > > > > https://ci.ignite.apache.org/viewLog.html?buildId=1543518& > tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Basic1 > > > > and so on. > > > > To fix this problem I propose to extract from "Run Basic" and "Run Cache" > > new > > dedicated ones for persistent tests TC configurations. > > > > Also, I would add some checking to not allow add new tests with > persistent > > to other TC configurations in future. > > > > It would allow us to run almost all TC configuration on any agent while > > configurations with persistent would have agent rules to not get a > timeout. > > > > Thoughts? > > > -- Best regards, Eduard. |
Hi all,
I'm reviving this thread because it seems to me that it might be better to go back to the combined in-memory and PDS suites now. As usual, it is a long one, so feel free to skip to the TLDR. The decision for the split was mostly driven by the desire to have shorter time of suite runs. It seems that the decision (and related change https://issues.apache.org/jira/browse/IGNITE-9100) was quite reasonable at the time. But as Vladimir said, the time required by the tests with persistence was impacted by HDD, and we do have SSDs now. Having to split the tests is a burden on all developers. Persistence is the core functionality, and core tests would mostly want to check it. Forcing to have a clone of a test just to manage the execution time of a suite seems to be an overkill. I see a potential benefit in having a test suite that doesn't require a fast disk, but for now it is just that - potential, while the additional work to split the tests is quite real. Finally, it is hard to make sense from how exactly to split the tests, which suites to them in, etc. I believe it makes more difficult for new community members to join - which is the last thing we want. TLDR I propose to allow adding persistence tests to any suite, remove the PERSISTENCE_IN_TESTS_IS_ALLOWED_PROPERTY property and related functionality added by https://issues.apache.org/jira/browse/IGNITE-9100. Instead of that, to manage the execution time of the suites we can take a habit of splitting the suites that take more than an hour as it is being done now in https://issues.apache.org/jira/browse/IGNITE-8849 . WDYT? Thanks, Stan On Fri, Jul 27, 2018 at 3:46 PM Eduard Shangareev <[hidden email]> wrote: > Vladimir, > > I am talking only about Run Cache and Basic. > I don't see any objection why we couldn't do so. > Even with extra SSD, it could be worth to split configuration to have > control over their impact on the disk system. > > > On Fri, Jul 27, 2018 at 3:28 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > Ed, > > > > We already discussed this some time ago. AFAIK SSD disks do not have this > > problem, so all we need is to replace HDD with SSD. > > > > On Fri, Jul 27, 2018 at 3:26 PM Eduard Shangareev < > > [hidden email]> wrote: > > > > > Igniters, > > > > > > I am working on the stability of our TC test runs. > > > > > > Some of our execution timeouts (hangings, unexpected stops) happen > > because > > > of issues in source code: test itself, test runners, configurations, > bug, > > > Linux OOM killer and so on. > > > > > > We could fix them by changing code. > > > > > > But almost all of the last issues with timeouts have happened because > > many > > > tests ran disk-intensive operations on one machine. > > > > > > Examples: > > > > > > > > > https://ci.ignite.apache.org/viewLog.html?buildId=1543562& > > tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_ZooKeeperDiscovery2 > > > > > > https://ci.ignite.apache.org/viewLog.html?buildId=1543518& > > tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Basic1 > > > > > > and so on. > > > > > > To fix this problem I propose to extract from "Run Basic" and "Run > Cache" > > > new > > > dedicated ones for persistent tests TC configurations. > > > > > > Also, I would add some checking to not allow add new tests with > > persistent > > > to other TC configurations in future. > > > > > > It would allow us to run almost all TC configuration on any agent while > > > configurations with persistent would have agent rules to not get a > > timeout. > > > > > > Thoughts? > > > > > > > > > -- > Best regards, > Eduard. > |
Free forum by Nabble | Edit this page |