Split the list of objects into two lists according to the ratio by size and division of objects into sections
I have a list of objects, call it dataset
and I want to split into two new lists, call them trainSet
and testSet
when reading the input line dataset
by line, where the list object inherits from the node
read stream object (in this particular case).
We then define a bifurcated ratio split_ratio=0.75
to divide dataset
by a trainSet
at least 75% of dataset
and 25% of the length of the list testSet
(for example):
var dataSize = dataset.length,
lim_75 = 0,
lim_25 = 0;
lim_75 = Math.floor(dataSize * split_ratio); // e.g.: .75
if (lim_75 < 1) lim_75 = 1;
lim_25 = Math.floor(dataSize * (1 - split_ratio)); // eg: 1-.75=.25
if (lim_25 < 1) lim_25 = 1;
When reading line by line, I will do
while (row = dataset.next()) {
var r = Math.random(); // split randomness seed
if (r < split_ratio && lim_75_count < lim_75 || r >= split_ratio && lim_25_count > lim_25) {
lim_75_count += 1;
trainSet.write(row);
} else {
lim_25_count += 1;
testSet.write(row);
}
}
This works, I get at the end two lists trainSet
of 0.75 size and testSet
from 0.25 of the list length dataset
.
Now, suppose my objects in a list dataset
have a structure like
{
objectId: 12345,
objectClass: 'CLASS_A`
objectValue: 'The quick brown fox jumps over the lazy dog'
}
where objectClass
belongs to an enumeration of values ββsuch as CLASS_A
, CLASS_B
etc.
I want to keep the above split_ratio
for the ratio of length trainSet
and testSet
, but adding a new condition that would allow me to divide dataset
by a new ratio to split the list into the object key objectClass
, according to this partition map:
[
{
objectClass: 'TYPE_A',
ratio: 0.30
},
{
objectClass: 'TYPE_B',
ratio: 0.20
},
{
objectClass: 'TYPE_C',
ratio: 0.50
}
]
What's the correct splitting condition that will work AND
with the first based on size?
Example
Real-world examples to clarify: Suppose that split_ratio=.70
you have to have trainSet.length=70
, testSet.length=30
if `dataset.length = 100.
Now given two types objectClass
with ratio
values
CLASS_A=0.20
and CLASS_B=0.80
, we must have 0.2*70/100
elements CLASS_A
at trainSet
and 0.80*30/100
at testSet
, while we will have 0.80*70/100
elements CLASS_B
at trainSet
and 0.20*30/100
elements CLASS_B
at testSet
.
In this case, it is assumed that the partition map contains only types CLASS_A
and CLASS_B
, such as:
[
{
objectClass: 'TYPE_A',
ratio: 0.20
},
{
objectClass: 'TYPE_B',
ratio: 0.80
}
]
source to share
No one has answered this question yet
Check out similar questions: