Any correspondence concerning this service should be sent
to the repository administrator:
[email protected]
This is an author’s version published in:
http://oatao.univ-toulouse.fr/22226
To cite this version:
Couso, Inès and Dubois, Didier A general
framework for maximizing likelihood under incomplete data. (2018)
International Journal of Approximate Reasoning, 93. 238-260.
ISSN 0888-613X
Official URL
DOI :
https://doi.org/10.1016/j.ijar.2017.10.030
Open Archive Toulouse Archive Ouverte
OATAO is an open access repository that collects the work of Toulouse
researchers and makes it freely available over the web where possible
A
general
framework
for
maximizing
likelihood
under
incomplete
data
✩
,
✩✩
Inés
Couso
a,∗,
Didier
Dubois
baDep.ofStatisticsandO.R.,UniversityofOviedo,Spain bIRIT,CNRSandUniversitédeToulouse,France
a b s t r a c t
Keywords:
Randomsets Maximum likelihood Incomplete information Entropy
Maximumlikelihood isa standardapproach tocomputinga probabilitydistribution that bestfits agiven dataset. However, whendatasets are incomplete or contain imprecise data, a major issue is to properly define the likelihood function to be maximized. This paper highlights the fact that there are several possiblelikelihoodfunctionstobeconsidered, dependingonthepurposetobeaddressed,namelywhether thebehavioroftheimperfect measurementprocesscausing incompletenessshould beincluded or notin the model, and what are the assumptions we can make or the knowledge we have about this measurement process.Variouspossibleapproaches,thatdifferbythechoiceofthelikelihoodfunction and/ orthe attitudeoftheanalyst infrontofimprecise informationarecomparatively discussedonexamples, andsomelightisshedonthenatureofthecorresponding solutions.
1. Introduction
ThekeyroleoflikelihoodfunctionsinstatisticalinferencewasfirsthighlightedbyFisher[16]withthemaximum
likeli-hoodprinciple.Inhisseminalbook,Edwards([15],p. 9)definesalikelihoodfunctionasbeingproportionaltotheprobability
ofobtainingresultsgiven ahypothesis,accordingtoaprobabilitymodel:
Let P(R|H) betheprobability ofobtainingresults R giventhehypothesis H , accordingto theprobabilitymodel . . . The
likelihoodofthehypothesisH givendata R,andaspecificmodel,isproportionalto P(R|H),theconstantof
proportion-alitybeingarbitrary.
Edwardsmentionsthat“thisprobabilityisdefinedforanymemberofthesetofpossibleresultsgivenanyonehypothesis
. . . As such itsmathematicalpropertiesare well-known.Afundamentalaxiom isthat if R1 and R2 aretwo ofthepossible
results, mutuallyexclusive,thenP(R1or R2|H)=P(R1|H)+P(R2|H)”.
✩ ThispaperispartoftheVirtualspecialissueonSoftMethodsinProbabilityandStatistics,editedbyBarbaraVantaggi,MariaBrigidaFerraro,Paolo
Giordani.
✩ Apreliminaryversionofthispaper[7]waspresentedatthe8thConferenceonSoftMethodsinProbabilityandStatistics(SMPS)inRoma,September,
12–14,2016.
*
Correspondingauthor.E-mailaddresses: [email protected] (I. Couso),[email protected] (D. Dubois). https://doi.org/10.1016/j.ijar.2017.10.030
In other words, a fundamental axiom is that the probability of obtaining at least one among two results is the sum
of the probabilities of obtaining each of these results. In particular, a result in the sense of Edwards is not any kind of
event, itisanelementaryevent. Onlyelementaryeventscanbeobserved.Forinstance,when tossingadie,andseeingthe
outcome, you cannot observethe event “odd”, you can only see1, 3or 5.So, alikelihood functionis proportionalto the
conditional probabilityofanelementaryevent(theobservedsample),wherethecondition part(thehypothesis)isavalue
ofsome modelparameter.Forinstance,theconditionalprobability ofthesureevent cannotbeviewed asthelikelihood of
thehypothesisgiventhesureevent.
If thispointofviewisaccepted, whatbecomesof thelikelihoodfunctionunderincompleteorimpreciseobservations?
To properly answerthis question,one mustunderstand whatis aresultin thiscontext. Namely, if weare interestedin a
certain random phenomenon modeled by arandom variable, observationswe getin thiscase maynot directlyinform us
aboutthisrandomvariable.Duetotheinterferencewithanimperfectmeasurementprocess,observationswillbeset-valued
[4,5]. So, inorderto properly exploitsuchincomplete information(called coarsedata in theliterature [21]),we mustfirst
decide whattomodel:
1. therandomphenomenonthrough itsmeasurementprocess;
2. ortherandomphenomenondespite itsmeasurementprocess.
Inthefirstcase,impreciseobservationsareconsideredasresults,andwecanconstructthelikelihoodfunctionofarandom
set, whose realizations are sets. These sets contain precise but ill-knownrealizations of the random variable of interest,
to which we have no direct access. We say that this unreachable random variable is latent. Actually, most authors are
interestedintheotherpointofview.Theyconsiderthatoutcomesaretheprecise,althoughill-observed,realizationsofthe
randomphenomenon,and wishtoreconstructadistributionforthelatentvariable.Howeverinthiscasethereareasmany
potential likelihood functions as precise datasets in agreement with the imprecise observations. Authors have proposed
several ways ofaddressingthis issue.The mosttraditional approach isbasedon theEMalgorithm [11,29,13],which isan
iterative procedure for efficient maximizationofthe likelihoodof observeddata. It constructs adistribution on the latent
variable that minimizes divergence from the parametricmodel in agreement with theavailable data. It can also serve to
reconstructasampleofthelatentvariable.
Inthispaper,weproposeaformalsettingforthemodeling ofimpreciselyobservedrandomexperiments,anddefinethe
threelikelihoodfunctions thatcanbebuiltinthisframework. Apartfromthelikelihoodfunctionbasedonavailable
obser-vations,thereisthelikelihoodfunctionbasedonoutcomesofthelatentrandomvariablethatwasimpreciselyobserved,and
thelikelihoodfunctionbasedonthejointprobabilityinducedbypairsofoutcomesandtheirmeasurement. Thetwolatter
likelihood functions areimprecisely known and we compare several alternatives to themaximization ofthe likelihood of
impreciseobservations,suchasthemaximaxapproach,andtherobustapproachtoincompletedata.Itincludesmorerecent
proposals byHüllermeier [22],or Guillaume and Dubois[18], orPlass etal.[35]. Wealso discuss theuseof assumptions
on themeasurement process suchas thecoarsening-at-random [21]and the superset assumptions,that help relatingthe
variouslikelihoodfunctions.Notethat inthispaperwedonotconsider theissueofimprecisionduetotoosmallanumber
ofpreciseobservations(see forinstance,MassonandDenœux[32],orSerrurierandPrade[44]).Weassume thatthecause
ofimprecisionliesintheincompletedescriptionoftherandomexperimentoutcomes,notinthescarcity ofobservations.
2. Therandomphenomenonanditsmeasurementprocess
Letarandomvariable X: Ä→ X represent theoutcomeofacertainrandom experiment.For thesakeofsimplicity,let
usassumethatitsrangeX= {a1,. . . ,am} isfinite.Supposethatobservationsof X areimprecise,namelyletŴ: Ä→ ℘ (X )
denote the (observable) multi-valued mapping representing our (imprecise) perception of X . So, if
ω
occurs then all weknow is that X(
ω
)∈ Ŵ(ω
)⊆ X. In other words, we assume that X is a selection of Ŵ, i.e. X(ω
)∈ Ŵ(ω
), ∀ω
∈ Ä. Thissetting is very close to the one of Dempster [10] who introduces a special case of upper and lower probabilities, based
on random sets, laterinterpretedbyShafer [45]asbeliefand plausibility functions. Theissue ofset-valued datahas been
discussedinRef.[5]fromthepointofviewofdescriptivestatistics.Inthispaperwestartaddressinginferentialstatistics.
Let Im(Ŵ)= {A1,. . . ,Ar}∈ ℘ (X ) denote the image of Ŵ (the collection of possible set-valued outcomes). We can
equivalently suppose that theimperfect measurement process is driven by another random variable Y , with finiterange
Y= {b1,. . . ,br}, that provides incompletereports ofobservations of X . Namely, Y(
ω
)=bj means that the measurementtoolreportsŴ(
ω
)=Aj.ThecardinalityoftheimageofIm(Y)= Y = {b1,. . . ,br}thuscoincideswiththatofIm(Ŵ)andthenthereisabijectionbetween Im(Ŵ)andY asfollows:
Y
(
ω
) =
bjiffŴ(
ω
) =
Aj,
j=
1, . . . ,r,oryetwecanassumethatbj=Aj.Let P(X,Y)bethejointprobabilitydescribing X anditsmeasurement.
Insomeapplications,thevariable X ismade oftwo components Xo and Xu respectivelycorrespondingtoobservedand
unobservedvariableswithrespectivedomainsXo andXu,andŴisoftheform{Xo}× Ŵu,i.e.,Y = {{X0}× Ŵu}.Theobserved
variable Y canthenbeidentifiedwiththerandomvectorY= (Xo,Yu),whereYu= Ŵu.
Thisframeworkhighlights thedifferencebetweentheoutcome X=ak (itsprobability is P(X=ak)),thefactthat event
measurement process(its probabilityis P(Y =Aj)).Thelatterisalwaysanelementaryevent, evenwhenitcorrespondsto
animpreciseobservationof X=ak.
In the paper we assume the results of experiments are available in the form of relative frequencies pˆ.j=
n.j
n, where
n.j denotes the number of observationsof bj=Aj in the sample, and n is the sample size. The probability distribution
( ˆp.1,. . . ,pˆ.r) on Y can also be viewed as a Dempster–Shafer mass assignment m on ℘ (X ) [45], letting m(Aj)= ˆp.j for
j=1,. . .r inducing lower probabilities in the sense of [10] in the form of a belief function Bel(A)=P
E⊆Am(A). This
Dempster–Shafermassassignmentdefinesaconvexset{PX:PX(A)≥Bel(A),∀A⊆ X }ofprobabilitiesonX,henceofjoint
probabilitiesonX× Y withknown marginalspˆ.j for j=1,. . . ,r on Y.
An alternative way of modeling the generation of coarse data consists in using so-called coarseningvariables [20]. It
supposes theexistenceofarandomvariableC valuedonafinitespaceC,andafunction F: X × C → ℘ (X)\ {∅}suchthat
Y=F(X,C).
We overview belowtwo different ways to represent theinformation about thejoint distribution ofthe randomvector
(X,Y).Subsection2.1willrefertotheoutcomeoftheexperiment X andthe“coarsening”or“imprecisiation”process1 that
leadsustojustgetimpreciseobservationsofX ,describedby Y .Subsection2.2willrepresentthejointdistributionof(X,Y)
the other wayaround, bymeans of themarginal probabilityof theobservations (Y) and theconditional probability of X
given Y . The “imprecisiation”or “disambiguation”viewsrespectively correspondto what Little[27]calls selectionmodels
and patternmixturemodels,albeitexpressedintheframeworkofmissingdatausingcoarsening variables.
2.1. Generationandimprecisiationprocesses
Letusconsiderthefollowingmatrix (M|p):
p.1|1.. . .
p.r|1. p1.. . .
. . .
. . .
. . .
p.1|m.. . .
p.r|m. pm.
where• p.j|k.=P(Y=Aj|X=ak) denotesthe(conditional)probabilityofobservingbj=Aj ifthetrueoutcomeisak and
• pk.=P(X=ak)denotestheprobabilitythatthetrueoutcomeisak.
Such a matrix determines thejoint probability distribution P(X,Y) modeling the underlying generating process plus the
connection betweentrueoutcomesand incompleteobservations.More specifically, thevector(p1.,. . . ,pm.)T characterizes
theunderlyinggenerating randomprocesswhilethematrix M= (p.j|k.)k=1,...,m;j=1,...,r istheso-calledmixing matrix([46])
that represents theimprecisiation process. In thesetting of Dempster’s upperand lower probabilities[10], nothingis
as-sumed about the matrix M and (p1.,. . . ,pm.)T is unknown. This is not the case in more recent works whose aim is to
retrieve information about X from information about Y , using a model of the measurement process, by means of some
assumptionon themixingmatrix M.
2.1.1. Someparticularsettingsandtheircharacteristicmatrices
• Partition. Suppose that {A1,. . . ,Ar} forms a partition of X. Therefore, we can easily observe that the probabilities
P(Y = Aj|X=ak)=1 if ak∈Aj, and 0 otherwise, ∀j,k. Then, wecan divide the m elementsof X into r categories
ofrespectivelyk1,. . . ,kr elementseach.WecandenoteX= {a11,. . . ,a1k1,. . . ,ar1,. . . ,arkr}andparticularizetheabove
matrixasfollows:
1. . .
0 p11.. . .
. . .
. . .
. . .
1. . .
0 p1k1.. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
0. . .
1 pr1.. . .
. . .
. . .
. . .
0. . .
1 prkr.
1 Theterm“coarsening”iscommonlyusedin theliteratureofstatisticswithincompletedata.Notwithstanding,theideaof coarseninghasbeenalso
linked totheideaof partitionandindiscernibility.For instance,in RoughSetTheory[34]itmeans“change ofgranularity”.Also,in Shafer’sTheoryof Evidence[45],itislinkedtotheideaofdefiningapartitionofindiscernibleelements.Indeedinsomecases,withintheliteratureofcoarsestatisticsthe notionofcoarseningvariablecomesdowntochoosingapartition,oneelementofwhichistheimpreciseobservation.Sotheterm“coarseningprocess” seemstooftenunderliethispartition-basedmodelingofimpreciseobservationgeneration.Howeverinthispaperthisismodeledbyameremulti-mapping inthestyleofDempster[10].Intherestofthepaper,wewillusetheterm“imprecisiation”,thatdoesnotpresupposecoarsedatatobegeneratedthrough partitioning.
where pil.=P(X=ail) denotes the (marginal) probability that X takes thevalue ail. Y is thena functionof X , Y =
f(X). In this case, the joint distribution of (X,Y) is determined by the marginal distribution of X . This procedure
determinesanequivalencerelationoverX:
aiRaj
⇔
f(a
i) =
f(a
j),
and therefore, a collection of equivalence classes, 5= {A1,. . . ,Ar}, determining a partition of X. It is clear that in
thiscase P(X,Y) isgeneratedbyacoarsening variablereducedtoaconstant( f(X)=F(X,c)).Thissetting istheone
proposedbyDempsteretal.[11]intheirfamouspaperon theEMalgorithm, presentedasanapproachto obtaininga
maximumlikelihoodestimate(MLE)underincompleteinformation.2
• Miss-or-observesetting. In this case, we assume that either the value of X is observed precisely or that it is not observedatall.Then,r=m+1 and{A1,. . . ,Ar}= {X ,{a1},. . . ,{am}}.
Let P(Ŵ= X |X=ak)=
α
k,P(Ŵ= {ak}|X=ak)=1−α
k, k=1,. . . ,m.Themixingmatrix M isthereforeoftheform:
out.\obs. X{a
1}
{a
i}
{a
m}
a1α
1 1−
α
1. . .
0 aiα
i. . .
1−
α
i amα
m. . .
. . .
1−
α
m
This is the situation of missing data [39]. In this case, there is a coarsening variable C with range {0,1}, such that
F(X,C)isasingletoninX if C=1 and isX otherwise. Wecanseethat P(C=0|X=ai)=1−
α
i=P(Y = X |X=ai).Animportantparticularcaseiswhentheprobabilities
α
i’sareconstant,thatis,theprobabilityofmissingdatadoesnotdependontheoutcomeofthelatentvariable.Itisalsoaparticularcaseofthemissing-completely-at random(MCAR)
assumptionknown in theliterature [28]. If thelatent variable X is made oftwo components Xo and Xu respectively
correspondingto observed and unobserved variables, theMCAR assumption then reads P(Ŵu= Xu|Xo,Xu)= P(Ŵu=
Xu)=
α.
Another usual assumption is the missing-at-random one (MAR), that reads P(Ŵu= Xu|Xo,Xu)= P(Ŵu =Xu|Xo).
• Coarseningatrandomassumption(CAR). This notion was introduced by Heitjan and Rubin (see [21]). According to
thisassumption,theunderlying datadonotaffect theobservations,i.e.,anobservationY =Aj isnotinfluencedbythe
specificvaluetakenbytherandomvariable X insidetheset Aj.Mathematically,thisconditionisexpressedasfollows:
P
(Y
=
Aj|X
=
ak) =
P(Y
=
Aj|
X=
ak′), ∀
k,k′,
with ak,
ak′∈
Aj, ∀
j,orequivalently,
P
(Y
=
Aj|X
=
ak) =
P(Y
=
Aj|
X∈
Aj),
j=
1, . . . ,r.AgeneralizationofthisnotionhasbeenrecentlyconsideredinamachinelearningcontextbyJ.Plassetal.inReferences
[35,36], where X involves observed and unobserved variables, respectively Xo and Xu. Under the assumption CAR,
which generalizes MCAR, set-valued observations Ŵu are assumed to beindependent from the true value of Xu. Let
usnotice that this generalizedassumption collapses into the original CAR definitionwhenever Xo is a constant. The
authorsof[35,36]alsointroducedakindof“orthogonal”assumptioninthismachinelearningcontextcalled“subgroup
independence (SI)”. Under this assumption, the set-valued observations Ŵu are assumed not to be influencedby the
value of Xo. Testability of this assumption is studied by the authors, a problem that falls out of the scope of our
manuscript.
• CoarseningCompletelyatRandomAssumption(CCAR)([39]). Thenaturaldefinitionwouldbethat ofimprecise
obser-vations Y independent of actual outcomes for X , i.e., P(X,Y)=P(X)P(Y). However Jaeger [25] points out that this
definitionisproblematicbecauseitallowsforjointoutcomes(x,A) suchthat x6∈A.Enforcing theconsistencybetween
set-valuedobservations(x(
ω
)∈ Ŵ(ω
)),theCCAR assumption mustreferto acoarsening variable. Jaeger [25]proposesthat P(X,Y) is CCAR with respect to a coarsening variable C if P(C=c|X=ak)=P(C=c|X=ak′) for every pair
ak,ak′∈ X ofelementswithpositiveprobability,i.e.,includedintheset{x∈ X : P(X=x)>0}.
• Supersetassumption([23]). AgainweconsiderobservedandunobservedvariableswithrespectivedomainsXo andXu.
Weassume that when X=x isfixed, theconditional P(Y=B|Xu=x) does notdepend on B,whenever x∈B andis
0otherwise. For every x∈ Xu the numberof subsets of X that contain it is thesame. Therefore P(Y =B|Xu=x)=
1/2#Xu−1.
Thisassumptionisdualto themissing-at-randomassumptioninthesensethatinthelatter,Y =B is fixedand P(Y =
B|Xu=x) doesnot dependon the choiceofx inside B,whilein thesuperset assumption,the set-valuedobservation
inducedby X=x canbeany supersetofx withequalprobability.Thisassumptionisoftenpresented ascapturingthe
ideaof“lackofinformation”aboutthemeasurementprocess.Infact,theuniformdistributionisusedtoreflectabalance
amongoutcomeswhen no informationisavailable. Thismodeling ofignorance has been questionedinthecontext of
2 Curiously,theconnectionbetweenthisapproachandthemoregeneralsettingforincompleteinformationof[10]isnotmadebyDempsteretal.[11].
non additiverepresentations of belief(e.g., Shafer belief functions [45], or Walley’s imprecise probability theory[50],
whereacompletelackofinformationisusuallyrepresentedbymeans ofavacuouspossibilitydistribution).
The superset assumption can be particularized to the case where Xo isa constant. It is then to the original general
supersetassumptionwhattheoriginalCARassumptionistothegeneralizedversionconsideredbyJ.Plassetal.[35,36]
Thenextexampleillustratesbothassumptions.
Example1. A coinis tossed. The random variable X: Ä→ X, where X= {h,t}, represents the resultof thetoss. We do
not directly observe the outcome, that is reported byPeter, whosometimes decides not to tell usthe result.The restof
thetime,theinformationheprovidesabout theoutcomeisfaithful.Let Y denotetheinformationprovidedbythisperson
about theresult.Ittakesthe“values”{h},{t}and {h,t}.
Thisexamplecorrespondstothefollowingmatrix(M|p):
µ
1
−
α
0α
p0 1
− β
β
1−
p¶
Themarginaldistributionof X (outcomeoftheexperiment)isgivenas
– p1.=P(X=h)=p,
– p2.=P(X=t)=1−p.
Thejointprobabilitydistributionof(X,Y)isthereforedetermined by:
X\Y{h}
{t}
{h,
t} h(1
−
α
)p
0α
p t 0(1
− β)(1
−
p)β(1
−
p)
ThemarginaldistributionofY (informationprovidedbyPeter)isthus:
– p.1=P(Y = {h})=P(Y= {h},X=h)+P(Y = {h},X=t)= (1−
α
)·p+0= (1−α
)·p,– p.2=P(Y = {t})=P(Y = {t},X=h)+P(Y = {t},X=t)=0+ (1− β)· (1−p)= (1− β)· (1−p), – p.3=P(Y = {h,t})=P(Y= {h,t},X=h)+P(Y = {h,t},X=t)=
α
·p+ β · (1−p).UndertheCARassumption,wehavethat
α
= β,i.e.,P
(Y
= {h,
t}|X=
h) =P(Y
= {h,t}|
X=
t).Letusnow considerthe(binary)randomvariableC thattakes thevalueC=0 whenY = {h,t}and C=1 otherwise.Let
F bedefined as F(x,0)= {h,t} and F(x,1)= {x}.The CCAR assumptionwrt thecoarsening variable C isequivalent tothe
above mentionedCAR condition.
Thesupersetassumptionismorerestrictiveandassumesthat
–
α
=P(Y= {h,t}|X=h)=P(Y= {h}|X=h)=0.5 and – β =P(Y = {h,t}|X=t)=P(Y= {t}|X=t)=0.5and therefore P(Y = {h,t})=0.5.In words, no matterwhat the trueoutcome is (heads ortails) Peter does not giveany
informationabout it50%ofthetime.Theonlyremainingparameteris p.
This example demonstrates that thesuperset assumption represents significant knowledgeon the observationprocess.
This is the price paid if we needto provide a stochastic model of the measurement process, as the uniformdistribution
over allsupersetsof {x} is theleast prejudiced probabilisticassumption we canmake. The superset assumptionis infact
stronger than theCAR assumption:since underthe former assumption, P(Y =B|Xu=x) only dependson the numberof
subsets ofX that contain x,and thisnumberisthesamefor allx,itfollows that P(Y=B|Xu=x)=P(Y =B|Xu=x′) for
x,x′∈B,whichisCAR.
2.2. Impreciseobservationsandtheirdisambiguations
Wecanalternativelycharacterizethejointprobability distributionof(X,Y) bymeans ofthemarginaldistributionofY
(observations) and theconditional probability ofeachresult X=ak, knowingthat theobservationwas Y =bj (or
equiva-lently Ŵ=Aj),forevery j=1,. . . ,r.
Thenewmatrix(M′|p′)canbewrittenasfollows:
p1.|.1. . .
pm.|.1 p.1. . .
. . .
. . .
. . .
p1.|.r. . .
pm.|.r p.r
243
where
• pk.|.j=P(X=ak|Y =Aj) denotesthe(conditional)probability thatthetruevalue of X isak if wehavebeen reported
thatitbelongsto Aj;
• p.j=P(Y =bj)=P(Y =Aj) denotes theprobability that thegeneration plus theimprecisiation processeslead us to
observe Aj.
Such amatrix alsodetermines thejointprobability distribution modelingtheunderlying generating process plusthe
con-nection between trueoutcomesand incompleteobservations.More specifically, thevector(p.1,. . . ,p.r)T characterizesthe
observationprocesswhilethematrixM′= (p
k.|.j)k=1,...,m;j=1,...,r represents theconditionalprobabilityof X (trueoutcome)
given Y (observation).
Somerecentstudieson“partialidentification”(see[30],forinstance)canbesomehowrelatedwiththisframework.They
considersituationswhereparametersofinterestarepartiallyidentified(see[31]).Forinstance,whentheparameteris
real-valued,theso-called“identificationregion”isasubsetoftherealline.Imbensand Manski[24]proposetheconstructionof
confidence intervalsthatcovereveryelement intheregionwithaspecificconfidencelevelwhoseboundscanbecomputed
fromsampledata.
Inthisregard,themarginaldistributionon X (p1.,. . . ,pm.) issometimespartiallyidentified,onthebasisofour
knowl-edgeofthemarginaldistributiononY.Infact,letusnoticethat,accordingtothetotalprobabilitytheorem,wecanwrite:
pk.
=
X
jp.j
·
pk.|.j, ∀
k=
1, . . . ,m.Furthermore, partial information about thematrix M′ issometimes available. As consequence, confidence regions for any
parameter inducingthemarginaldistribution(p1.,. . . ,pm.) on X canbederivedfrom theaboveinformation,on thebasis
ofobservablefrequencies,thatmayallowustoprovideconfidence estimationsforthemarginaldistributiononY.
Considerforinstanceamiss-or-observeproblem,whereY takesr=m+1 valuesoftheformbj= {aj}, j=1,. . . ,m and
bm+1= {X }.Theequality P(X=ak|Y = {ak})=1 holds,foreveryk=1,. . . ,m.Furthermore, P(X=ak|Y = X )isknown to
beincluded intheunitinterval.Basedon theaboveinformationand on observablemarginalfrequencies,wecan compute
set-valuedconfidence estimationsforevery(p1.,. . . ,pm.) oralternatively,foraparameterdetermining it.
One example of assumption in this setting is the UniformConditionalDistributionAssumption. In this case we assume
that if Aj is observed, all thepossible outcomes ak∈Aj are equally probable, due to a symmetry argument such asthe
insufficientreasonprinciple.Theconditionaldistributionisthengivenby:
pk.|.j
=
(
1# Aj
,
if ak∈
Aj0 otherwise.
Knowingthedistributionon Y (whichcanbeestimatedfromthedata,themarginaldistributiononX canbeestimatedas
wellsince: pk.
=
rX
j=1 pk.|.j·
p.j=
X
j:Aj∋ak 1 # Aj p.j.
Note thatthisassumptionissimilarto thesuperset assumption,exchangingtherolesofsubsetsand elementsofX.Inthe
coin-tossingexample,itcomesdowntoassuming
P
(X
=
h|Y= {h,
t}) =P(X
=
t|Y= {h,
t}) =0.5 insteadof P(Y = {h,t}|X=h))=P(Y= {h,t})|X=t)=0.5.Viewingaprobabilitydistributionon Y asaDempster–Shafermassassignmentm on ℘ (X )asmentionedatthe
begin-ning of thissection, PX, as defined above, is the pignistic transform[47] of the belieffunctioninduced bythe following
massassignment:
m(Aj
) =
p.j, ∀
j=
1, . . . ,r.oryet itsShapleyvalue. Moregenerally,fixingamixing matrix M′ comes downtopicking aprobabilitydistributionon X
fromtheconvexcredalset{PX:PX(A)≥Bel(A),∀A⊆ X }. 3. Maximumlikelihoodstrategiesunderincompleteinformation
Each matrix(M|p) or(M′|p′) isenough tounivocallycharacterizethejoint distributionof(X,Y). Foreachpair (k,j)∈ {1,. . . ,m}× {1,. . . ,r},let pkj denote thejointprobability pkj=P(X=ak,Y =Aj).According tothenomenclatureused in
theprecedingsubsections,therespectivemarginalsonX and Y aredenotedasfollows:
• pk.=P(X=ak)=Prj=1pkj willdenote themassof X=ak,foreveryk.
Now, letusassume that theabove jointdistribution ischaracterized bymeans ofa(vector of)parameter(s)θ ∈2(inthe
sense that entriesin M and M′ canbewrittenasfunctionsof θ).Wenaturallyassumethat thenumberofcomponents of
θ islessthanorequaltothedimensionofbothmatrices,i.e.,itislessthanorequaltomin{m× (r+1),r(m+1)}.Inother
words, theapproach uses aparametricmodel suchthat avalue ofθ determinesa jointdistribution on X× Y.When the
joint probability measure is theparametric distribution associated to the (vector of)value(s) θ of the parameter,we will
respectivelyusethenomenclature
• pθ kj=P(X=ak,Y=Aj;θ ), • pθ k.= Pr j=1pθkj=P(X=ak;θ ) and • pθ.j=Pm k=1pθkj=P(Y =Aj;θ ).
Let us consider asequence Z= ((X1,Y1),. . . ,(XN,YN)) of N iid random variablesthat are “copies” of Z= (X,Y). We
willusethenomenclaturez= ((x1,y1),. . . ,(xN,yN))∈ (X × Y)N torepresent aspecificsample ofthevector(X,Y).Thus, y= (y1,. . . ,yN) will denote the observed sample (an observation of the vector Y= (Y1,. . . ,YN)), and x= (x1,. . . ,xN)
will denote an arbitrary artificial sample from X for the unobservable (latent) variable X , that we shall vary in XN. Let
(G1,. . . ,GN)bethesequenceofsubsetsofX thatcorrespondstotheobservedsampley (namelyif Yj=yj itcorresponds
to Xj∈Gj).
Wecandescribeanysample z infrequentisttermsassumingexchangeability:
• nkj=PNi=11{(ak,bj)}(xi,yi)isthenumberofrepetitionsof(ak,bj) insample z;
• Pm
k=1nkj=n.j bethenumberofobservationsofbj=Aj iny;
• Pr
j=1nkj=nk. bethenumberofappearancesofaj inx.
Clearly, Pm
k=1nk.=Prj=1n.j=N. Let thereader notice that, once a specific sample y= (y1,. . . ,yN)∈ YN has been
ob-served, the number of nkj repetitions of each pair (ak,bj)∈ X × Y in the sample, can be expressed as a function of
x= (x1,. . . ,xN). Moreover, inthe following, Xy denotes the collection offeasible marginalsamples (x1,. . . ,xN) of X , in
accordancewiththeobservationy:
Xy
= {
x∈
XN:
xi∈
Gi,
i=
1, . . . ,N}and likewiseZy denotethecollectionoffeasible(joint)samples(z1,. . . ,zN) of Z ,inaccordancewiththeobservationy:
Zy
= {
z∈ (
X×
Y)
N:
zi= (x
i,
yi)
and xi∈
Gi,
i=
1, . . . ,N}. 3.1. Whichlikelihoodfunction?Wemay considerthreedifferent likelihoodfunctions (and theirrespectivelogarithms), dependingon whetherwerefer
to theobserved sample y= (y1,. . . ,yN), thesample of (ill-observed) outcomes x= (x1,. . . ,xN), orthe complete sample
z, and a fourth expression that interprets imprecise observations as events. We will use the following nomenclature to
distinguishthemfromeachother:
Visiblelikelihoodfunction p(y;θ )=QN
i=1p(yi;θ ) denotes theprobability of observing y∈ YN, assuming that thevalue of
theparameterisθ.Itcanbealternativelyexpressedasp(y;θ )=Qr
j=1(pθ.j)
n.j,wheren
.j denotesthenumberofrepetitions
ofbj=Aj inthesampleofsizeN (thenumberoftimesthatthereportersaysthattheoutcomeoftheexperimentbelongs
to Aj.)Thelogarithm ofthislikelihoodfunctionwillbedenotedby
Ly
(θ ) =
log p(y; θ ) =
NX
i=1 log p(yi; θ ) =
rX
j=1 n.jlog pθ.j.
Wecallp(y;θ )thevisiblelikelihoodfunction,becausewecancomputeitbasedontheavailabledataonly,thatistheobserved
sampley.Itisalsosometimescalledthemarginallikelihoodoftheobserveddata intheEMliterature,nottobeconfusedwith
themarginallikelihood inaBayesiancontext(see [3],forinstance). Facelikelihoodfunction Note thatp(y;θ )differs fromthequantity
λ(
y; θ ) =
rY
j=1
called the “face likelihood” in Ref. [9,26]. The latter quantitydoes not referto theobservation process, and replaces the
probability of reporting Aj as theresult ofan observation(i.e. P(Y =Aj)) by theprobability that the true outcomefalls
inside the set Aj, P(X∈ Aj). In particular, the occurrence of event “ X∈Aj” is a consequence of, but does not
neces-sarily coincide with the outcome “Y =Aj”. In our context, p(y;θ ) represents the probability of occurrence of the result
“(y1,. . . ,yN)=y”, given the hypothesis θ. Therefore given two arbitrary different samples y6=y′ the respective events
(y1,. . . ,yN)=y and“(y1,. . . ,yN)=y′”aremutuallyexclusive.Incontrast,λ(y;θ )denotestheprobabilityofoccurrenceof
theevent(X1,. . . ,XN)∈G1× . . . ×GN,whereGj=Aj if Yj=Aj.Eventsofthisformmayoverlap,inthesensethat,given
two differentsamples y6=y′,thecorresponding events(X1,. . . ,XN)∈G1× . . . ×GN and (X1,. . . ,XN)∈G′1× . . . ×G′N are
notnecessarilymutuallyexclusive.Thereforeλ(y;θ ) cannotberegardedasalikelihoodinthesenseofEdwards([15]).
However, underthe CAR assumptionand the assumptionof distinctnessofparameters, maximizingthe facelikelihood
is thesame asmaximizingthe visiblelikelihood. Jaeger ([25], Th. 2.18) pointsout that under CAR,thefollowing equality
obtains
P
(Y
=
A|X=
x) =P(Y
=
A|X∈
A) = P(Y
=
A) P(X
∈
A).
ThisiseasytoseenoticingthatsincefromCAR, P(Y =A|X=x)=kA doesnotdependon x∈A,theequality
P
(X
=
x|Y=
A) ·P(Y
=
A) =P(Y
=
A|X=
x) ·P(X
=
x) impliesX
x∈A P(X
=
x|Y=
A) ·P(Y
=
A) =kA·
X
x∈A P(X
=
x).Hence P(X∈A|Y =A)·P(Y=A)=kA·P(X∈A),but P(X∈A|Y =A)=1.So, P(Y =A)onlydependsontheprobabilityof
event A onX.Wecandeducefromthisfactthat, undertheassumptionofseparabilitywithrespectto M (alsoreferredto
asdistinctnessoftheparameters)theargumentsofthemaximaofthevisibleand thefacelikelihoodsdo coincide.Further
comments aboutthisequivalencewillbeprovidedinSubsection4.5.
Latentlikelihoodfunction p(x,θ )=QN
i=1p(xi;θ )=Qmi=1(pθk.)
nk., wherenk. denotes thenumberof occurrencesofak inthe
sample x= (x1,. . . ,xN). Thissampleisneverobserveddirectly,byassumption.Howeveritmustbeinagreementwiththe
observedsampley.Eachvirtualsampleof X inXy yieldsapossiblelikelihoodfunctionp(x,θ )inagreementwiththeactual
observationsy.Thelogarithmofthislikelihoodfunctionwillbedenotedby
Lx
(θ ) =
log p(x; θ ) =
NX
i=1 log p(xi; θ ) =
mX
k=1 nk.log pkθ.,
where nk.=Prj=1nkj and nkj is such that Pmk=1nkj=n.j, the numberof times Y =Aj has been observed. Note that the
definitionofnk. isinagreementwithx∈ Xy.Wecallp(x;θ )thelatentlikelihoodfunction,becausex isnotactuallyobserved,
norarethenk.’s,sinceonly then.j’sare.
Totallikelihoodfunction p(z,θ )=QN
i=1p(zi;θ )=Qmk=1
Qr
j=1(pθkj)nkj is thelikelihoodfunctioninducedbythewhole
artifi-cialsample z,wecallthetotallikelihood. Again,itmustbeinagreementwiththeobservedsampley= (y1,. . . ,yN),which
is fixedbyassumption.Each virtualsampleof Z inZy yields apossible likelihoodfunctionp(z,θ ) inagreementwiththe
actual observationsy. Wewilldenoteitslogarithmby
Lz
(θ ) =
log p(z; θ ) =
NX
i=1 log p(zi; θ ) =
mX
k=1 rX
j=1 nkjlog pθkj.
Maximizing p(z,θ ) allows usto introduce assumptions on themeasurement process, eitherin terms of imprecisiationor
disambiguation.Namely,theconditionalprobabilitiesp.j|k. maybeknown because,forinstance,thesupersetassumptionis
made. Alternatively,probabilities p.k|.j could beknown,which, alongwith aparticular distributionon Y (tobeestimated
from observations y), is enough to derive a concrete distribution on Z and therefore on X. More generally, there may
be some dependence between the process driving the latent variable X and the measurement process driving theactual
observationsy.Inthiscase,maximizingLz(θ )enablesthiskindofadditionalinformationtobeaccountedfor.
Remark3.1.Intheaboveexpressions,weusetheconvention00=1.Inotherwords,theexpressionQm
k=1
Qr
j=1p
nkj
kj replaces
theformallycorrectexpression
Y
(k,j)∈{1,...,m}×{1,...,r} :nkj6=0
Example2. Consideragain Example 1, i.e.the cointossing experiment,assuming for 10 tosses that Peter reports 4times
Heads,2timesTailsand 4timesnothing.Letuswritethefourlikelihoodfunctions.
• Visiblelikelihood: p(y,θ )=P({{h}})4·P({{t}})2·P({{h,t}})4=£[(1−
α
)p]4[(1−α
)(1−p)]2α
4¤using the parameters
introducedearlier. Notethat P({{h}})+P({{t}})+P({{h,t}})=1 asitisaprobabilitydistributionon 2{h,t}.
• Facelikelihood:p(y,θ )=P({h})4·P({t})2=p4(1−p)2 since P({h,t})=1.Optimizingitcomesdown toforgettingthe
missinginformation.
• Hidden likelihood: p(x,θ )=p4+n13· (1−p)6−n13, where n
13 isthe unknown number oftimes Peter does not report
whentheresultisHead.
• Totallikelihood:p(z,θ )=£[(1−
α
)p]4[(1−α
)(1−p)]2(α
p)n13(α
(1−p))4−n13¤.Thefouroutcomesobtainedby
describ-ingboththetossresultand thereportareevaluated.
3.2. Maximumlikelihoodstrategies
Inthispaperwewill comparedifferent existingstrategiesoflikelihood maximization,basedon asequenceof
observa-tionsy= (y1,. . . ,yN)∈ YN:
• MaximizingLy(θ ). The argument of the maximum of Ly considered as amapping defined on 2 is called maximum
likelihoodestimator(MLE)i.e.:
ˆ
θ =
arg max θ ∈2L y(θ ) =
arg max θ ∈2 rY
j=1(p
θ.j)
n.j.
Notethatthismaximizationprocessdoesnotneedanyreferencetothenon-observedvariable X .FromoptimizingLy(θ ),
whatis obtained is a probability distribution on Y, which, asalready suggested can also beviewed as aDempster–
Shafer massassignment mθ on ℘ (X ), letting mθ(Aj)=p.θj for j=1,. . .r. But a concrete choice of θ ∈2 also leads
us to select a specific joint distribution on X × Y (pθ
i j)i=1,...,m;j=1,...r. When the argument of the maximum of the
log-likelihoodfunctionLy isnotunique,theMLEdetermines acollectionofjointdistributionson X× Y. Undersome
circumstances,thiscollectionofjointdistributionscoincideswiththecredalsetassociatedtoamassfunction,andsuch
amass functiondetermines a unique distribution on Y. We will provide a brief discussionabout such a situation in
Example 3.TheEMalgorithm[29]isaniterativetechniquethatuses alatentvariable X inordertoreachalocal
maxi-mumofLy whenitsoptimizationistricky.Inthiscase,wealsoobtain apreciseimputationx∈ Xy.Thelatentvariable
issometimesfictitious,asinthecaseoflearningamixtureofnormaldistributions[11].
• MaximizingLx(θ ).ThisisthegenuinegoalifoneisinterestedtofindtheMLEofX despitetheimprecisedata.However,
sincetheprecisesample x isnotavailable, thereisasubset LXy(θ )= {Lx(θ ):x∈ Xy} ofpossible likelihood functions
[18].So wemustfindnotonlyanoptimalvalueofθ, butalsoanoptimalsamplex,accordingtosome strategy.There
aretwoobviousstrategiesthatcometomind:
1. Maximaxstrategy:findapair(x∗∗,θ∗∗)∈ XN× 2satisfying
(
x∗∗, θ
∗∗) =
arg max x∈Xy,θ ∈2L x(θ ) =
arg max x∈Xy,θ ∈2 mY
k=1(p
θk.)
nk..
It comes down to maximizing anupper log-likelihood function Lx(θ )=max{Lx(θ ):x∈ Xy}, which can be viewed
as an optimistic strategy; it tends to favor distributions with small entropy, under certain conditions, aswe shall
seelater.Themaximax techniquehas beenproposedbyE.Hüllermeier([22])usingmoregenerallossfunctions.His
papermakesthepointthatthechoiceofanoptimalpair(x∗∗,θ∗∗)leadstoasimultaneousselectionofabestmodel
togetherwithadisambiguationoftheimpreciseobservations.
2. Maximinstrategy:findapair(x∗∗,θ∗∗)∈ XN× 2satisfying:
(
x∗∗, θ
∗∗) =
arg max θ ∈2xmin∈XyL x(θ ) =
arg max θ ∈2xmin∈Xy mY
k=1(p
θk.)
nk.,
wherenk.=PNi=11{ak}(xi) denotesthenumberofrepetitionsofak inthesamplex.Itcomesdowntomaximizinga
lowerlog-likelihoodfunctionLx(θ )=min{Lx(θ ):x∈ Xy},whichcanbeviewedasarobuststrategy,thatcopeswith
theimprecisionofthelikelihoodfunction;ittendstofavordistributionswithlargedispersions,asweshallseelater.
Themaximintechniquehasbeen proposedinRef.[18].
Notethat onemight objecttothese approaches.First,considering Lx(θ )or Lx(θ ) requiresthecomparisonofvaluesof
Lx(θ ) for several samples x,which maximal likelihood advocates will strongly question. Following them, one cannot
comparelikelihood functionscomingfrom distinctdata sets[15].Howeverone mayreplytoitthat inthecaseofour
Thisdatasetisuniquebutill-known,andtheconsideredsamplesareinagreementwiththesamebodyofobservations
y.It isthensurethatthetrue likelihoodfunctionliesintheinterval [Lx(θ ),Lx(θ )].Maximizingoneof itsbounds isa
usualstrategyinthefaceofintervals.Wemayalsouseanyotherstrategythatcomparesintervals,forexamplethesafe
butverydemanding(ifnotimpossibletoreach)partial intervalorderingrequirement Lx(θ⋆)>Lx(θ ),∀θ 6= θ⋆.
Hüllermeier[22]justifiesthemaximaxapproachbysayingthatifLx1(θ1)>L
x2
(θ2)thenthesamplex1isarguablymore
plausiblethanthe sample x2,simplybecause thefirstinstantiationallowsforamuchbetterfittothemodelbasedon
θ1 than the second one based on θ2. This philosophyleads to disambiguating theimprecise data through the choice
ofthebest model. However, thisline ofreasoning makessense if we aresure that therandom process generating X
followsamodelintheprescribedclassparameterizedbyθ.Thenitisnaturaltoconsiderthatthemostplausiblevalues
compatible withthe impreciseobservationsare those which enablea best fitwith theclass ofparameterized model,
sothatonemayselectatthesametimethebestmodelandthebestsample thatjustifiesit.Inthatcase,theresulting
disambiguationis a form of data reconciliation [14]. In contrast, if theset of parameterizedmodels is chosen for its
computationalsimplicityand isknown to beanapproximationof therealphenomenon, thedisambiguationrationale
ofthemaximaxapproachisthennotsostrong.
• MaximizingLz(θ ).Assaidearlierthisisthenaturalwaytogo, ifsomeinformationregardingthedependencebetween
thelatentvariable X anditsmeasurementprocessisavailable,forinstancethesupersetortheCARassumptionismade.
Wecanagain adoptmaximax ormaximinstrategies,sincethefullsamplez isnotavailable,andonly theobservations
y are. There is also an iterative strategy that exploitsthe links between X and Y , such as the EMalgorithm, which
maximizes Ly(θ )viatheproductionofafakesamplez.
1. Themaximaxstrategyaimsatfindingthepair(z∗,θ∗)∈ ZN× Äthatmaximizes thefunctionLz(θ ):
(
z∗, θ
∗) =
arg max z∈Zy,θ ∈2L z(θ ) =
arg max z∈Zy,θ ∈2 mY
k=1 rY
j=1(p
θ kj)
nkj.
Itcomes down tomaximizingan upperlog-likelihoodfunction LZy(θ )=max{Lz(θ ):z∈ Zy}. Thecomplete sample
z∗alsoyieldsanoptimalsamplex∗∈ Xy sinceLz(θ )canbeviewedasafunction f
y: XN× 2→ Rthatonlydepends
onx.Thismaximizationprocedurehasbeen consideredin[23]underthesupersetassumption.
2. Itisclearwecansimilarlyenvisagethecorrespondingmaximinstrategyandfindthepair(x∗,θ∗)∈ XN× Äinduced
bythepair(z∗,θ∗)that maximizesthelowerlog-likelihoodfunctionLZ
y (θ )=min{Lz(θ ):z∈ Zy}.
(
z∗, θ
∗) =
arg max θ ∈2zmin∈ZyL z(θ ) =
arg max θ ∈2zmin∈Zy mY
k=1 rY
j=1(p
θkj)
nkj.
3. Guessaninitialvalue ofθ, whichenables toconstructafictitious samplez; then,anMLE ofθ forthissample can
befound,andthisprocessisiteratedtillconvergence.ThiskindofstrategyisadoptedbytheEMalgorithmand tries
tofindaprobabilitymodel pθ ascloseaspossibletotheempiricaldistributionofafakesamplez∈ Zy,inthesense
ofKullback–Leiblerdivergence[29](seealso[6]).
Inthispaper,wefocusonthemaximaxandmaximinstrategiesformaximizingLxand Lz.
3.3. ConnectionsbetweenMLEstrategies
Undersomeparticularconditionsabout thematrices M and M′,someoftheabove maximizationproceduresmay
coin-cide.Below, someresultsareprovided.Therearetwokindsofresults:somethatrelate thetotal likelihoodfunctionp(z;θ )
and thelatent one p(x;θ ) under suitable assumptions, and those that relate the total likelihood function p(z;θ ) and the
visible one p(y;θ ). This is done by introducing assumptions about the incomplete data or the conditional distributions
describingthemeasurement process.Itcomes downtosomeinformationaboutthematricesM and M′.
A first issue concernsthe parameter θ, which so far isused inthe three likelihood functions asdriving the joint
dis-tribution on Z= X × Y, hencetherespective marginalson X and Y. Insome situations, X and Y are driven bydistinct
parameters θ1,θ2.
Thefollowingresultconcernsthedisambiguationpointofviewand involvesmatrix M′.
Definition1.We saythattheparameter θ ∈2isseparable withrespectto thematrix(M′|p′) ifit canbe“separated”into
two(maybemultidimensional)componentsθ1∈ 21,θ2∈ 22suchthat2= 21× 22,where pk.|.θ jand pθ.j canberespectively
writtenasfunctionsofθ1 andθ2.
Proposition1.∪z∈Zyarg maxθ ∈2Lz(θ )⊆arg maxθ ∈2Ly(θ )providedthatθ isseparablewrt(M′|p′).3
3 Rememberthatarg max
Proof. Let y ∈ YN denote the observed sample. Let us select an arbitrary complete sample z∈ Zy. p(z;θ ) = Qr j=1 Qm k=1(pθkj)nkj= Qr j=1 Qm k=1(pθk.|.j·pθ.j)nkj= Qr j=1(pθ.j) Pm k=1nkj·Qr j=1 Qm k=1(pk.|.θ j)nkj= Qr j=1(pθ.j)n.j· Qr j=1 Qm k=1(pθk.|.j)nkj.
Thus,if θ isseparablewrt(M′|p′)wecanwrite:
p
(
z; θ ) =
rY
j=1(p
θ1 .j)
n.j rY
j=1 mY
k=1(p
θ2 k.|.j)
nkj.
Thus, if θ∗ isan optimalparameter suchthat Lz(θ∗)=max
θ ∈2Lz(θ ),thenits projectionon 21, θ1∗∈ 21 mustnecessarily satisfytheequalityLy(θ1∗)=maxθ1∈21L
y(θ
1). ✷
Nowletuschecktheconsequenceoftheuniformconditionaldistributionassumption.
Proposition2.Lety= (y1,. . . ,yN)∈ YN denotetheobservedsample.LetussupposethatQrj=1
Q
k:nkj6=0p
nkj
k.|.j isavaluec that
does notdependonthe particularchoice ofz∈ Zy,noronθ.Then foreveryz∈ Zy wehavep(z;θ )=cp(y;θ )andtherefore arg max(x,θ )∈Xy×2p(z;θ )=arg maxθ ∈2arg minz∈Zyp(z;θ )=arg maxθ ∈2p(y;θ ).
Proof. Usingthepreviousproof,wealready havethat
p
(
z; θ ) =
rY
j=1 pn.j .jY
k:nkj6=0(p
k.|.j)
nkj=
cp(y; θ ).
✷
Proposition3.Lety= (y1,. . . ,yN)∈ YNdenotetheobservedsample.Letusconsidertheuniformconditionaldistribution
assump-tion.ThenQr
j=1
Q
k:nkj6=0p
nkj
.j|k.isavaluec thatdoesnotdependontheparticularchoiceofz∈ Zy,noronθ. Proof. Undertheuniformconditionaldistributionassumptionwehave:
pk.|.j
=
(
1 # Aj if ak∈
Aj 0 otherwise. Therefore, rY
j=1Y
k:nkj6=0 pnk.|.kjj=
rY
j=1 1 # Aj P k:nkj6=0nkj=
rY
j=1µ
1 # Aj¶
n.j.
✷
Corollary4.Iftheuniformconditionaldistributionassumptionholdsthen,forallz∈ Zy,p(z;θ )=cp(y;θ ),wherec dependsneither ontheparticularz∈ Zynoronθandtherefore
arg max
(x,θ )∈Xy×2p
(
z; θ ) =
arg maxθ ∈2arg minz∈Zyp(
z; θ ) =
arg maxθ ∈2p(
y; θ ).
Thenextresultsconcerntheimprecisiationprocessandinvolvematrix M:
Definition2. Wesay that theparameter θ ∈2is separable with respectto thematrix (M|p) if it canbe“separated” into
two (maybemultidimensional)components θ3∈ 23,θ4∈ 24 suchthat 2= 23× 24 and pθ.j|k. and pθk. canberespectively
writtenasfunctionsofθ3 andθ4.
This typeofseparabilitycorrespondsto thenotion of“distinct parameters”inthe sense ofHeitjanand Rubin ([21]) in
thecontextofcoarsedata,andLittleand Rubin([28])inthecontextofmissingdata.
Proposition5.Ifθ isseparablewrt(M|p)then,givenaspecificsamplex∈ XNandthecorrespondingz∈ (X × Y)N inducedbyx
andy,arg maxθ ∈2Lz(θ )⊆arg max
θ ∈2Lx(θ ).
Proof. Theproofofthisresultissimilartotheone giveninProposition 1. ✷
Remark3.2.Proposition 5 assumes afixed sample x∈ XN. Let usnotice that theseparability wrt M does notimply that
the respective solutionsof bothmaximax problems,θ∗∗=arg maxθ ∈2L
Xy
(θ ) and θ∗=arg maxθ ∈2L
Zy
(θ ), coincide. They
Proposition6.Lety= (y1,. . . ,yN)∈ YNdenotetheobservedsample.LetussupposethatQmk=1
Q
j:nkj6=0p
nkj
.j|k.isavaluec thatdoes
notdependontheparticularchoiceofz∈ Zy,noronθ.Then,forallx∈ Xyandthecorrespondingz∈ Zywehavep(z;θ )=cp(x;θ )
andthereforearg max(x,θ )∈Xy×2p(x;θ )=arg max(x,θ )∈Xy×2p(z;θ ). Proof. p
(
z; θ ) =
mY
k=1Y
j:nkj6=0 pnkjkj=
mY
k=1Y
j:nkj6=0(p
.j|k.·
pk.)
nkj=
mY
k=1Y
j:nkj6=0 pn.jkj|k.·
pnk.kj=
mY
k=1 p Pr j=1nkj k.·
mY
k=1Y
j:nkj6=0(p
.j|k.)
nkj=
mY
k=1 pnk. k.·
c=
p(x; θ ) ·
c.✷
Proposition7.Lety= (y1,. . . ,yN)∈ YN denotetheobservedsample.Letussupposethat{A1,. . . ,Ar}formsapartitionofX or
thatthesupersetassumptionissatisfied.ThenQm
k=1
Q
j:nkj6=0p
nkj
.j|k.isaconstantc.
Proof. On one hand, we can easily check that, if {A1,. . . ,Ar} forms a partition of X then Qmk=1
Q
j:nkj6=0p
nkj
.j|k.=1. Now,
let us check that the above condition holds under the superset assumption. Under the superset assumption we have
already shown that p.j|k.=2m−1 if ak ∈Aj, and 0 otherwise. Therefore, Qmk=1
Q j:nkj6=0p nkj .j|k.= Qm k=1 ³ 1 2m−1 ´Pj:nkj6=0nkj = Qm k=1 ³ 1 2m−1 ´nk. =³2m1−1 ´N . ✷
Corollary8.Ifanyofthefollowingconditionsissatisfied:
• {A1,. . . ,Ar}formsapartitionofX
• Thesupersetassumptionholds
thenp(z;θ )=cp(x;θ )andthereforearg max(x,θ )∈XN×2p(x;θ )=arg max(x,θ )∈XN×2Lz(θ ).Furthermorec=1 inthefirstcase.
Mostapproachesinstatistical inferenceinsiston thenecessitytohave astatisticalmodel oftheobservationprocess. In
that case thenatural likelihood functiontomaximize is p(z,θ ). Incontrast, if wemaximize the latent likelihoodfunction
p(x,θ ) directly with respect to both x and θ, we in some way give up the idea of providing a statistical model for the
observation(imprecisiation) process.Thesupersetassumption canbeused tojustifytheuseofp(x,θ ) byprovidingsucha
statistical model,since inthat case maximizingp(z,θ ) is thesameasmaximizing p(x,θ ) (this is themessage apparently
carriedbytheauthorsof[23],forinstance).Whennoinformationaboutthemeasurementprocessisavailable,itistousan
openquestionwhetheroneshouldmaximize p(x,θ )orp(z,θ )withrespecttobothx andθ.Indeed,thetwocorresponding
MLEmaydifferasindicatedonexamplesinthefollowing.
4. Comparingthemaximumlikelihoodstrategiesonexamples
Based on some of the results provided in Subsection 3.3, we can compare the various choices of likelihood functions
(maximization of Ly(θ ),Lx(θ ),Lz(θ ))inan impreciseenvironment,on the basisof theacceptabilityofresults obtained by
theirmaximizationonanumberofprototypicalexamples.Theseexamples,shedlightonthenatureofthemaximinandthe
maximaxstrategies,asopposedtomaximizingthevisiblelikelihoodfunction.Weconsiderseveralsettingswhereimprecise
dataoccurinvariousways:thepossibleobservationsmayformapartitionofspaceX,thedataiseitherpreciseormissing,
and thegeneral casewhereimprecisedatamayoverlap.
4.1. Observationsformingapartition:separablecase
ThefirstexampleillustratesthesituationwhereimprecisedataformapartitionofX and theparameter θ isseparable.
Example3. Consider therandom experiment that consistson rolling adice. We do not knowwhether the diceis fair or
not. Suppose that theperson that rollsit justtells uswhether theoutcomeiseven or odd.Let X betherandom variable
denoting the actual outcome of the dice roll (ai=i for i=1,. . . ,6) and let Y be a binary variable taking the values b1
(odd) and b2 (even). So, the collection {A1= {1,3,5},A2= {2,4,6}} determines a partition of the whole set of possible
outcomesX= {1,. . . ,6}.Letthe6-dimensionalvector(p1.,. . . ,p6.)representtheactual(unknown)probabilitydistribution
of X ,where pi.=P(X=ai),i=1,. . . ,6 and p6.=1−
P5
i=1pi..
Let p.2=