Partykit minsize parameter shrinks branches larger than minsize

I am using a function lmtree()

from partykit

to split data using linear regressions. Regressions use weight and I want each branch to have a minimum total weight that I specify with the option minsize

. For example, in the following example, the tree only has two branches instead of three, because it x1=="C"

is too light to be in its own branch.

n <- 100
X <- rbind(
  data.frame(TT=1:n, x1="A", weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1="B", weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1="C", weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
X$x1 <- factor(X$x1)
tr <- lmtree(y ~ TT | x1, data=X, weight=weight, minsize=150)

Fitted party:
[1] root
|   [2] x1 in A: n = 200
|       (Intercept)          TT 
|         0.7724903   0.2002023 
|   [3] x1 in B, C: n = 300
|       (Intercept)          TT 
|         0.5759213   0.4659592 

      

I also have some real data, which is unfortunately confidential, but leads to some behavior that I don't understand. When I did not point out minsize

, it builds a tree with 30 branches, where each branch the total weight of n

a large number. However, when I indicate minsize

which is well below the total weight of each branch from that first tree, the result is a new tree with many branches. I would not expect the tree to change at all, because it seems like minsize

it is optional. Are there any explanations for this result?

UPDATE

Providing an example

n <- 100
X <- rbind(
  data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
tr <- lmtree(y ~ TT | x1, data=X, weights = weight)

Fitted party:
[1] root
|   [2] x1 <= 0.29787: n = 200
|       (Intercept)          TT 
|         0.8431985   0.1994021 
|   [3] x1 > 0.29787
|   |   [4] x1 <= 0.69515: n = 200
|   |       (Intercept)          TT 
|   |         0.6346980   0.3995678 
|   |   [5] x1 > 0.69515: n = 100
|   |       (Intercept)          TT 
|   |         0.4792462   0.5987472 

      

Now set set minsize=150

. The tree no longer splits, although x1 <= 0.3

it x1 > 0.3

will work.

tr <- lmtree(y ~ TT | x1, data=X, weights = weight, minsize=150)

Fitted party:
[1] root: n = 500
    (Intercept)          TT 
      0.6870078   0.3593374

      

+3


source to share


1 answer


In this context, two rules applied in mob()

(the underlying infrastructure lmtree()

) are important , which can benefit from a more explicit discussion:

  • If mob()

    chooses a splitting variable at any stage that then does not result in one valid split (in terms of the minimum node size), then the split stops at that point. This is in contrast to ctree()

    which always performs a split if a significant test was found, even if the second best variable was not significant. It would probably be nice to offer more granular control over this - and we have that on our wishlist for an upcoming revision of the package.

  • By default it is weights

    interpreted as the weight of the enclosure, i.e. mob()

    believes there are independent observations w

    identical to this one. Thus, the number of observations is the sum of the weights. But note that this also affects significance tests for which the sample size increases!

As for your main question, it's hard to find an explanation without some kind of reproducible example. I agree that I partykit

should behave the way you describe it, but maybe there is one important but not so obvious detail that you haven't noticed yet ... It would be nice if you could come up with a small / simple artificial dataset that replicates the problem.

Update

As already pointed out in the comments: Thanks for the reproducible example in your updated question. This helped me catch a bug in mob()

handling case weights. When calculating statistical statistics in the presence of case weights, an error occurred, which led to an incorrect selection of the selection and stopping the division. I just fixed this bug and a new development version partykit

is available from R-Forge at https://r-forge.r-project.org/R/?group_id=261 . (Note, however, that R-Forge currently only builds Windows binaries for R 3.3.x. If using a later version of Windows, use type = "source"

the source package to install - and make sure you have the required Rtools installed.)

In your example, I just set a random seed for accurate reproducibility. Weighted data is configured as:

set.seed(1)
n <- 100
X <- rbind(
  data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)

      

Then the weighted tree can be established as before. In this particular example, the tree structure remains unaffected, but the test statistic and p-values ​​of the parameter volatility test in each node change someaht:

library("partykit")
tr1 <- lmtree(y ~ TT | x1, data = X, weights = weight)
plot(tr1)

      

tree1



Adding an argument minsize = 150

now has the expected effect, just avoiding splitting in node 3.

tr2 <- lmtree(y ~ TT | x1, data = X, weights = weight, minsize = 150)
plot(tr2)

      

tree2

To make sure the latter actually does the right thing, we compare it to a tree for explicitly extended data. Thus, since the data is treated here as case weights, we can inflate the dataset by repeating these observations with weights greater than 1.

Xw <- X[rep(1:nrow(X), X$weight), ]
tr3 <- lmtree(y ~ TT | x1, data = Xw, minsize = 150)

      

The obtained coefficients are the same (up to very small numerical differences):

all.equal(coef(tr2), coef(tr3))
## [1] TRUE

      

And more importantly, all test statistics and p-values ​​in the nodes are also the same:

library("strucchange")
all.equal(sctest(tr2), sctest(tr3))
## [1] TRUE

      

+1


source







All Articles