A general framework for maximizing likelihood under incomplete data

(1)

Any correspondence concerning this service should be sent

to the repository administrator:

[email protected]

This is an author’s version published in:

http://oatao.univ-toulouse.fr/22226

To cite this version:

Couso, Inès and Dubois, Didier A general

framework for maximizing likelihood under incomplete data. (2018)

International Journal of Approximate Reasoning, 93. 238-260.

ISSN 0888-613X

Official URL

DOI :

https://doi.org/10.1016/j.ijar.2017.10.030

Open Archive Toulouse Archive Ouverte

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

(2)

A

general

framework

for

maximizing

likelihood

under

incomplete

data

✩

,

✩✩

Inés

Couso

a,∗

,

Didier

Dubois

b

a_Dep._of_Statistics_and_O.R.,_University_of_Oviedo,_Spain b_IRIT,_CNRS_and_Université_de_Toulouse,_France

a b s t r a c t

Keywords:

Randomsets Maximum likelihood Incomplete information Entropy

Maximumlikelihood isa standardapproach tocomputinga probabilitydistribution that bestfits agiven dataset. However, whendatasets are incomplete or contain imprecise data, a major issue is to properly define the likelihood function to be maximized. This paper highlights the fact that there are several possiblelikelihoodfunctionstobeconsidered, dependingonthepurposetobeaddressed,namelywhether thebehavioroftheimperfect measurementprocesscausing incompletenessshould beincluded or notin the model, and what are the assumptions we can make or the knowledge we have about this measurement process.Variouspossibleapproaches,thatdifferbythechoiceofthelikelihoodfunction and/ orthe attitudeoftheanalyst infrontofimprecise informationarecomparatively discussedonexamples, andsomelightisshedonthenatureofthecorresponding solutions.

1. Introduction

ThekeyroleoflikelihoodfunctionsinstatisticalinferencewasfirsthighlightedbyFisher[16]withthemaximum

likeli-hoodprinciple.Inhisseminalbook,Edwards([15],p. 9)definesalikelihoodfunctionasbeingproportionaltotheprobability

ofobtainingresultsgiven ahypothesis,accordingtoaprobabilitymodel:

Let P(R|H) betheprobability ofobtainingresults R giventhehypothesis H , accordingto theprobabilitymodel . . . The

likelihoodofthehypothesisH givendata R,andaspecificmodel,isproportionalto P(R|H),theconstantof

proportion-alitybeingarbitrary.

Edwardsmentionsthat“thisprobabilityisdefinedforanymemberofthesetofpossibleresultsgivenanyonehypothesis

. . . As such itsmathematicalpropertiesare well-known.Afundamentalaxiom isthat if R1 and R2 aretwo ofthepossible

results, mutuallyexclusive,thenP(R1or R2|H)=P(R1|H)+P(R2|H)”.

✩ _This_paper_is_part_of_the_Virtual_special_issue_on_Soft_Methods_in_Probability_and_Statistics,_edited_by_Barbara_Vantaggi,_Maria_Brigida_Ferraro,_Paolo

Giordani.

✩ Apreliminaryversionofthispaper[7]waspresentedatthe8thConferenceonSoftMethodsinProbabilityandStatistics(SMPS)inRoma,September,

12–14,2016.

*

Correspondingauthor.

E-mailaddresses: [email protected] (I. Couso),[email protected] (D. Dubois). https://doi.org/10.1016/j.ijar.2017.10.030

(3)

In other words, a fundamental axiom is that the probability of obtaining at least one among two results is the sum

of the probabilities of obtaining each of these results. In particular, a result in the sense of Edwards is not any kind of

event, itisanelementaryevent. Onlyelementaryeventscanbeobserved.Forinstance,when tossingadie,andseeingthe

outcome, you cannot observethe event “odd”, you can only see1, 3or 5.So, alikelihood functionis proportionalto the

conditional probabilityofanelementaryevent(theobservedsample),wherethecondition part(thehypothesis)isavalue

ofsome modelparameter.Forinstance,theconditionalprobability ofthesureevent cannotbeviewed asthelikelihood of

thehypothesisgiventhesureevent.

If thispointofviewisaccepted, whatbecomesof thelikelihoodfunctionunderincompleteorimpreciseobservations?

To properly answerthis question,one mustunderstand whatis aresultin thiscontext. Namely, if weare interestedin a

certain random phenomenon modeled by arandom variable, observationswe getin thiscase maynot directlyinform us

aboutthisrandomvariable.Duetotheinterferencewithanimperfectmeasurementprocess,observationswillbeset-valued

[4,5]. So, inorderto properly exploitsuchincomplete information(called coarsedata in theliterature [21]),we mustfirst

decide whattomodel:

1. therandomphenomenonthrough itsmeasurementprocess;

2. ortherandomphenomenondespite itsmeasurementprocess.

Inthefirstcase,impreciseobservationsareconsideredasresults,andwecanconstructthelikelihoodfunctionofarandom

set, whose realizations are sets. These sets contain precise but ill-knownrealizations of the random variable of interest,

to which we have no direct access. We say that this unreachable random variable is latent. Actually, most authors are

interestedintheotherpointofview.Theyconsiderthatoutcomesaretheprecise,althoughill-observed,realizationsofthe

randomphenomenon,and wishtoreconstructadistributionforthelatentvariable.Howeverinthiscasethereareasmany

potential likelihood functions as precise datasets in agreement with the imprecise observations. Authors have proposed

several ways ofaddressingthis issue.The mosttraditional approach isbasedon theEMalgorithm [11,29,13],which isan

iterative procedure for efficient maximizationofthe likelihoodof observeddata. It constructs adistribution on the latent

variable that minimizes divergence from the parametricmodel in agreement with theavailable data. It can also serve to

reconstructasampleofthelatentvariable.

Inthispaper,weproposeaformalsettingforthemodeling ofimpreciselyobservedrandomexperiments,anddefinethe

threelikelihoodfunctions thatcanbebuiltinthisframework. Apartfromthelikelihoodfunctionbasedonavailable

obser-vations,thereisthelikelihoodfunctionbasedonoutcomesofthelatentrandomvariablethatwasimpreciselyobserved,and

thelikelihoodfunctionbasedonthejointprobabilityinducedbypairsofoutcomesandtheirmeasurement. Thetwolatter

likelihood functions areimprecisely known and we compare several alternatives to themaximization ofthe likelihood of

impreciseobservations,suchasthemaximaxapproach,andtherobustapproachtoincompletedata.Itincludesmorerecent

proposals byHüllermeier [22],or Guillaume and Dubois[18], orPlass etal.[35]. Wealso discuss theuseof assumptions

on themeasurement process suchas thecoarsening-at-random [21]and the superset assumptions,that help relatingthe

variouslikelihoodfunctions.Notethat inthispaperwedonotconsider theissueofimprecisionduetotoosmallanumber

ofpreciseobservations(see forinstance,MassonandDenœux[32],orSerrurierandPrade[44]).Weassume thatthecause

ofimprecisionliesintheincompletedescriptionoftherandomexperimentoutcomes,notinthescarcity ofobservations.

2. Therandomphenomenonanditsmeasurementprocess

Letarandomvariable X: Ä→ X represent theoutcomeofacertainrandom experiment.For thesakeofsimplicity,let

usassumethatitsrangeX= {a1,. . . ,am} isfinite.Supposethatobservationsof X areimprecise,namelyletŴ: Ä→ ℘ (X )

denote the (observable) multi-valued mapping representing our (imprecise) perception of X . So, if

ω

occurs then all we

know is that X(

ω

)∈ Ŵ(

ω

)⊆ X. In other words, we assume that X is a selection of Ŵ, i.e. X(

ω

)∈ Ŵ(

ω

), ∀

ω

∈ Ä. This

setting is very close to the one of Dempster [10] who introduces a special case of upper and lower probabilities, based

on random sets, laterinterpretedbyShafer [45]asbeliefand plausibility functions. Theissue ofset-valued datahas been

discussedinRef.[5]fromthepointofviewofdescriptivestatistics.Inthispaperwestartaddressinginferentialstatistics.

Let Im(Ŵ)= {A1,. . . ,Ar}∈ ℘ (X ) denote the image of Ŵ (the collection of possible set-valued outcomes). We can

equivalently suppose that theimperfect measurement process is driven by another random variable Y , with finiterange

Y= {b1,. . . ,br}, that provides incompletereports ofobservations of X . Namely, Y(

ω

)=bj means that the measurement

toolreportsŴ(

ω

)=Aj.ThecardinalityoftheimageofIm(Y)= Y = {b1,. . . ,br}thuscoincideswiththatofIm(Ŵ)andthen

thereisabijectionbetween Im(Ŵ)andY asfollows:

Y

(

ω

) =

bjiff

Ŵ(

ω

) =

Aj

,

j

=

1, . . . ,r,

oryetwecanassumethatbj=Aj.Let P(X,Y)bethejointprobabilitydescribing X anditsmeasurement.

Insomeapplications,thevariable X ismade oftwo components Xo and Xu respectivelycorrespondingtoobservedand

unobservedvariableswithrespectivedomainsX_o andX_u,andŴisoftheform{Xo}× Ŵu,i.e.,Y = {{X0}× Ŵu}.Theobserved

variable Y canthenbeidentifiedwiththerandomvectorY= (Xo,Yu),whereYu= Ŵu.

Thisframeworkhighlights thedifferencebetweentheoutcome X=ak (itsprobability is P(X=ak)),thefactthat event

(4)

measurement process(its probabilityis P(Y =Aj)).Thelatterisalwaysanelementaryevent, evenwhenitcorrespondsto

animpreciseobservationof X=ak.

In the paper we assume the results of experiments are available in the form of relative frequencies pˆ.j=

n.j

n, where

n_.j denotes the number of observationsof bj=Aj in the sample, and n is the sample size. The probability distribution

( ˆp_.1,. . . ,pˆ.r) on Y can also be viewed as a Dempster–Shafer mass assignment m on ℘ (X ) [45], letting m(Aj)= ˆp.j for

j=1,. . .r inducing lower probabilities in the sense of [10] in the form of a belief function Bel(A)=P

E⊆Am(A). This

Dempster–Shafermassassignmentdefinesaconvexset{PX:PX(A)≥Bel(A),∀A⊆ X }ofprobabilitiesonX,henceofjoint

probabilitiesonX× Y withknown marginalspˆ.j for j=1,. . . ,r on Y.

An alternative way of modeling the generation of coarse data consists in using so-called coarseningvariables [20]. It

supposes theexistenceofarandomvariableC valuedonafinitespaceC,andafunction F: X × C → ℘ (X)\ {∅}suchthat

Y=F(X,C).

We overview belowtwo different ways to represent theinformation about thejoint distribution ofthe randomvector

(X,Y).Subsection2.1willrefertotheoutcomeoftheexperiment X andthe“coarsening”or“imprecisiation”process1 that

leadsustojustgetimpreciseobservationsofX ,describedby Y .Subsection2.2willrepresentthejointdistributionof(X,Y)

the other wayaround, bymeans of themarginal probabilityof theobservations (Y) and theconditional probability of X

given Y . The “imprecisiation”or “disambiguation”viewsrespectively correspondto what Little[27]calls selectionmodels

and patternmixturemodels,albeitexpressedintheframeworkofmissingdatausingcoarsening variables.

2.1. Generationandimprecisiationprocesses

Letusconsiderthefollowingmatrix (M|p):





p.1|1.

. . .

p.r|1. p1.

. . .

p.1|m.

. . .

p.r|m. pm.





where

• p.j|k.=P(Y=Aj|X=ak) denotesthe(conditional)probabilityofobservingbj=Aj ifthetrueoutcomeisak and

• p_k.=P(X=a_k)denotestheprobabilitythatthetrueoutcomeisa_k.

Such a matrix determines thejoint probability distribution P(X,Y) modeling the underlying generating process plus the

connection betweentrueoutcomesand incompleteobservations.More specifically, thevector(p1.,. . . ,pm.)T characterizes

theunderlyinggenerating randomprocesswhilethematrix M= (p_.j|k.)k=1,...,m;j=1,...,r istheso-calledmixing matrix([46])

that represents theimprecisiation process. In thesetting of Dempster’s upperand lower probabilities[10], nothingis

as-sumed about the matrix M and (p1.,. . . ,pm.)T is unknown. This is not the case in more recent works whose aim is to

retrieve information about X from information about Y , using a model of the measurement process, by means of some

assumptionon themixingmatrix M.

2.1.1. Someparticularsettingsandtheircharacteristicmatrices

• Partition. Suppose that {A1,. . . ,Ar} forms a partition of X. Therefore, we can easily observe that the probabilities

P(Y = Aj|X=ak)=1 if ak∈Aj, and 0 otherwise, ∀j,k. Then, wecan divide the m elementsof X into r categories

ofrespectivelyk1,. . . ,kr elementseach.WecandenoteX= {a11,. . . ,a1k1,. . . ,ar1,. . . ,arkr}andparticularizetheabove

matrixasfollows:







1

. . .

0 p11.

. . .

1

. . .

0 p1k1.

. . .

0

. . .

1 pr1.

. . .

0

. . .

1 prkr.







1 _The_term_{“coarsening”}_is_commonly_used_in _the_literature_of_statistics_with_incomplete_data._{Notwithstanding,}_the_idea_of _coarsening_has_been_also

linked totheideaof partitionandindiscernibility.For instance,in RoughSetTheory[34]itmeans“change ofgranularity”.Also,in Shafer’sTheoryof Evidence[45],itislinkedtotheideaofdefiningapartitionofindiscernibleelements.Indeedinsomecases,withintheliteratureofcoarsestatisticsthe notionofcoarseningvariablecomesdowntochoosingapartition,oneelementofwhichistheimpreciseobservation.Sotheterm“coarseningprocess” seemstooftenunderliethispartition-basedmodelingofimpreciseobservationgeneration.Howeverinthispaperthisismodeledbyameremulti-mapping inthestyleofDempster[10].Intherestofthepaper,wewillusetheterm“imprecisiation”,thatdoesnotpresupposecoarsedatatobegeneratedthrough partitioning.

(5)

where pil.=P(X=ail) denotes the (marginal) probability that X takes thevalue ail. Y is thena functionof X , Y =

f(X). In this case, the joint distribution of (X,Y) is determined by the marginal distribution of X . This procedure

determinesanequivalencerelationoverX:

a_iRa_j

⇔

f

(a

_i

) =

f

(a

_j

),

and therefore, a collection of equivalence classes, 5= {A1,. . . ,Ar}, determining a partition of X. It is clear that in

thiscase P(X,Y) isgeneratedbyacoarsening variablereducedtoaconstant( f(X)=F(X,c)).Thissetting istheone

proposedbyDempsteretal.[11]intheirfamouspaperon theEMalgorithm, presentedasanapproachto obtaininga

maximumlikelihoodestimate(MLE)underincompleteinformation.2

• Miss-or-observesetting. In this case, we assume that either the value of X is observed precisely or that it is not observedatall.Then,r=m+1 and{A1,. . . ,Ar}= {X ,{a1},. . . ,{am}}.

Let P(Ŵ= X |X=a_k)=

α

k,P(Ŵ= {ak}|X=ak)=1−

α

k, k=1,. . . ,m.Themixingmatrix M isthereforeoftheform:







out.\obs. X

{a

1

}

{a

i

}

{a

m

}

a1

α

1 1

−

α

1

. . .

0 ai

α

i

. . .

1

−

α

i am

α

m

. . .

1

−

α

m







This is the situation of missing data [39]. In this case, there is a coarsening variable C with range {0,1}, such that

F(X,C)isasingletoninX if C=1 and isX otherwise. Wecanseethat P(C=0|X=ai)=1−

α

i=P(Y = X |X=ai).

Animportantparticularcaseiswhentheprobabilities

α

i’sareconstant,thatis,theprobabilityofmissingdatadoesnot

dependontheoutcomeofthelatentvariable.Itisalsoaparticularcaseofthemissing-completely-at random(MCAR)

assumptionknown in theliterature [28]. If thelatent variable X is made oftwo components Xo and Xu respectively

correspondingto observed and unobserved variables, theMCAR assumption then reads P(Ŵu= Xu|Xo,Xu)= P(Ŵu=

X_u)=

α.

Another usual assumption is the missing-at-random one (MAR), that reads P(Ŵu= Xu|Xo,Xu)= P(Ŵu =

X_u|Xo).

• Coarseningatrandomassumption(CAR). This notion was introduced by Heitjan and Rubin (see [21]). According to

thisassumption,theunderlying datadonotaffect theobservations,i.e.,anobservationY =Aj isnotinfluencedbythe

specificvaluetakenbytherandomvariable X insidetheset Aj.Mathematically,thisconditionisexpressedasfollows:

P

(Y

=

Aj

|X

=

ak

) =

P

(Y

=

Aj

|

X

=

ak′

), ∀

k,k′

,

with a_k

,

a_k′

∈

A_j

, ∀

j,

orequivalently,

P

(Y

=

A_j

|X

=

a_k

) =

P

(Y

=

A_j

|

X

∈

A_j

),

j

=

1, . . . ,r.

AgeneralizationofthisnotionhasbeenrecentlyconsideredinamachinelearningcontextbyJ.Plassetal.inReferences

[35,36], where X involves observed and unobserved variables, respectively Xo and Xu. Under the assumption CAR,

which generalizes MCAR, set-valued observations Ŵu are assumed to beindependent from the true value of Xu. Let

usnotice that this generalizedassumption collapses into the original CAR definitionwhenever Xo is a constant. The

authorsof[35,36]alsointroducedakindof“orthogonal”assumptioninthismachinelearningcontextcalled“subgroup

independence (SI)”. Under this assumption, the set-valued observations Ŵu are assumed not to be influencedby the

value of Xo. Testability of this assumption is studied by the authors, a problem that falls out of the scope of our

manuscript.

• CoarseningCompletelyatRandomAssumption(CCAR)([39]). Thenaturaldefinitionwouldbethat ofimprecise

obser-vations Y independent of actual outcomes for X , i.e., P(X,Y)=P(X)P(Y). However Jaeger [25] points out that this

definitionisproblematicbecauseitallowsforjointoutcomes(x,A) suchthat x6∈A.Enforcing theconsistencybetween

set-valuedobservations(x(

ω

)∈ Ŵ(

ω

)),theCCAR assumption mustreferto acoarsening variable. Jaeger [25]proposes

that P(X,Y) is CCAR with respect to a coarsening variable C if P(C=c|X=a_k)=P(C=c|X=a_k′) for every pair

ak,ak′∈ X ofelementswithpositiveprobability,i.e.,includedintheset{x∈ X : P₍X=x₎_>0}.

• Supersetassumption([23]). AgainweconsiderobservedandunobservedvariableswithrespectivedomainsX_o andX_u.

Weassume that when X=x isfixed, theconditional P(Y=B|Xu=x) does notdepend on B,whenever x∈B andis

0otherwise. For every x∈ Xu the numberof subsets of X that contain it is thesame. Therefore P(Y =B|Xu=x)=

1/2#Xu−1_.

Thisassumptionisdualto themissing-at-randomassumptioninthesensethatinthelatter,Y =B is fixedand P(Y =

B|Xu=x) doesnot dependon the choiceofx inside B,whilein thesuperset assumption,the set-valuedobservation

inducedby X=x canbeany supersetofx withequalprobability.Thisassumptionisoftenpresented ascapturingthe

ideaof“lackofinformation”aboutthemeasurementprocess.Infact,theuniformdistributionisusedtoreflectabalance

amongoutcomeswhen no informationisavailable. Thismodeling ofignorance has been questionedinthecontext of

2 _Curiously,_the_connection_between_this_approach_and_the_more_general_setting_for_incomplete_information_of_[10]_is_not_made_by_Dempster_et_al._[11]_.

(6)

non additiverepresentations of belief(e.g., Shafer belief functions [45], or Walley’s imprecise probability theory[50],

whereacompletelackofinformationisusuallyrepresentedbymeans ofavacuouspossibilitydistribution).

The superset assumption can be particularized to the case where Xo isa constant. It is then to the original general

supersetassumptionwhattheoriginalCARassumptionistothegeneralizedversionconsideredbyJ.Plassetal.[35,36]

Thenextexampleillustratesbothassumptions.

Example1. A coinis tossed. The random variable X: Ä→ X, where X= {h,t}, represents the resultof thetoss. We do

not directly observe the outcome, that is reported byPeter, whosometimes decides not to tell usthe result.The restof

thetime,theinformationheprovidesabout theoutcomeisfaithful.Let Y denotetheinformationprovidedbythisperson

about theresult.Ittakesthe“values”{h},{t}and {h,t}.

Thisexamplecorrespondstothefollowingmatrix(M|p):

µ

1

−

α

0

α

p

0 1

− β

β

1

−

p

¶

Themarginaldistributionof X (outcomeoftheexperiment)isgivenas

– p1.=P(X=h)=p,

– p2.=P(X=t)=1−p.

Thejointprobabilitydistributionof(X,Y)isthereforedetermined by:





X\Y

{h}

{t}

{h,

t} h

(1

−

α

)p

0

α

p t 0

(1

− β)(1

−

p)

β(1

−

p)





ThemarginaldistributionofY (informationprovidedbyPeter)isthus:

– p.1=P(Y = {h})=P(Y= {h},X=h)+P(Y = {h},X=t)= (1−

α

)·p+0= (1−

α

)·p,

– p.2=P(Y = {t})=P(Y = {t},X=h)+P(Y = {t},X=t)=0+ (1− β)· (1−p)= (1− β)· (1−p), – p_.3=P(Y = {h,t})=P(Y= {h,t},X=h)+P(Y = {h,t},X=t)=

α

·p+ β · (1−p).

UndertheCARassumption,wehavethat

α

= β,i.e.,

P

(Y

= {h,

t}|X

=

h) =P

(Y

= {h,t}|

X

=

t).

Letusnow considerthe(binary)randomvariableC thattakes thevalueC=0 whenY = {h,t}and C=1 otherwise.Let

F bedefined as F(x,0)= {h,t} and F(x,1)= {x}.The CCAR assumptionwrt thecoarsening variable C isequivalent tothe

above mentionedCAR condition.

Thesupersetassumptionismorerestrictiveandassumesthat

–

α

=P(Y= {h,t}|X=h)=P(Y= {h}|X=h)=0.5 and – β =P(Y = {h,t}|X=t)=P(Y= {t}|X=t)=0.5

and therefore P(Y = {h,t})=0.5.In words, no matterwhat the trueoutcome is (heads ortails) Peter does not giveany

informationabout it50%ofthetime.Theonlyremainingparameteris p.

This example demonstrates that thesuperset assumption represents significant knowledgeon the observationprocess.

This is the price paid if we needto provide a stochastic model of the measurement process, as the uniformdistribution

over allsupersetsof {x} is theleast prejudiced probabilisticassumption we canmake. The superset assumptionis infact

stronger than theCAR assumption:since underthe former assumption, P(Y =B|Xu=x) only dependson the numberof

subsets ofX that contain x,and thisnumberisthesamefor allx,itfollows that P(Y=B|Xu=x)=P(Y =B|Xu=x′) for

x,x′∈B,whichisCAR.

2.2. Impreciseobservationsandtheirdisambiguations

Wecanalternativelycharacterizethejointprobability distributionof(X,Y) bymeans ofthemarginaldistributionofY

(observations) and theconditional probability ofeachresult X=ak, knowingthat theobservationwas Y =bj (or

equiva-lently Ŵ=Aj),forevery j=1,. . . ,r.

Thenewmatrix(M′_|_p′₎_can_be_written_as_follows:





p1.|.1

. . .

pm.|.1 p.1

. . .

p1.|.r

. . .

pm.|.r p.r





(7)

243

where

• pk.|.j=P(X=ak|Y =Aj) denotesthe(conditional)probability thatthetruevalue of X isak if wehavebeen reported

thatitbelongsto Aj;

• p.j=P(Y =bj)=P(Y =Aj) denotes theprobability that thegeneration plus theimprecisiation processeslead us to

observe Aj.

Such amatrix alsodetermines thejointprobability distribution modelingtheunderlying generating process plusthe

con-nection between trueoutcomesand incompleteobservations.More specifically, thevector(p_.1,. . . ,p.r)T characterizesthe

observationprocesswhilethematrixM′_{= (}_p

k.|.j)k=1,...,m;j=1,...,r represents theconditionalprobabilityof X (trueoutcome)

given Y (observation).

Somerecentstudieson“partialidentification”(see[30],forinstance)canbesomehowrelatedwiththisframework.They

considersituationswhereparametersofinterestarepartiallyidentified(see[31]).Forinstance,whentheparameteris

real-valued,theso-called“identificationregion”isasubsetoftherealline.Imbensand Manski[24]proposetheconstructionof

confidence intervalsthatcovereveryelement intheregionwithaspecificconfidencelevelwhoseboundscanbecomputed

fromsampledata.

Inthisregard,themarginaldistributionon X (p1.,. . . ,pm.) issometimespartiallyidentified,onthebasisofour

knowl-edgeofthemarginaldistributiononY.Infact,letusnoticethat,accordingtothetotalprobabilitytheorem,wecanwrite:

p_k_.

=

X

j

p.j

·

pk.|.j

, ∀

k

=

1, . . . ,m.

Furthermore, partial information about thematrix M′ issometimes available. As consequence, confidence regions for any

parameter inducingthemarginaldistribution(p1.,. . . ,pm.) on X canbederivedfrom theaboveinformation,on thebasis

ofobservablefrequencies,thatmayallowustoprovideconfidence estimationsforthemarginaldistributiononY.

Considerforinstanceamiss-or-observeproblem,whereY takesr=m+1 valuesoftheformbj= {aj}, j=1,. . . ,m and

bm+1= {X }.Theequality P(X=ak|Y = {ak})=1 holds,foreveryk=1,. . . ,m.Furthermore, P(X=ak|Y = X )isknown to

beincluded intheunitinterval.Basedon theaboveinformationand on observablemarginalfrequencies,wecan compute

set-valuedconfidence estimationsforevery(p1.,. . . ,pm.) oralternatively,foraparameterdetermining it.

One example of assumption in this setting is the UniformConditionalDistributionAssumption. In this case we assume

that if Aj is observed, all thepossible outcomes ak∈Aj are equally probable, due to a symmetry argument such asthe

insufficientreasonprinciple.Theconditionaldistributionisthengivenby:

pk.|.j

=

(

₁

# Aj

,

if ak

∈

Aj

0 otherwise.

Knowingthedistributionon Y (whichcanbeestimatedfromthedata,themarginaldistributiononX canbeestimatedas

wellsince: p_k.

=

r

X

j=1 p_k.|.j

·

p.j

=

X

j:Aj∋ak 1 # Aj p.j

.

Note thatthisassumptionissimilarto thesuperset assumption,exchangingtherolesofsubsetsand elementsofX.Inthe

coin-tossingexample,itcomesdowntoassuming

P

(X

=

h|Y

= {h,

t}) =P

(X

=

t|Y

= {h,

t}) =0.5 insteadof P(Y = {h,t}|X=h))=P(Y= {h,t})|X=t)=0.5.

Viewingaprobabilitydistributionon Y asaDempster–Shafermassassignmentm on ℘ (X )asmentionedatthe

begin-ning of thissection, PX, as defined above, is the pignistic transform[47] of the belieffunctioninduced bythe following

massassignment:

m(Aj

) =

p.j

, ∀

j

=

1, . . . ,r.

oryet itsShapleyvalue. Moregenerally,fixingamixing matrix M′ comes downtopicking aprobabilitydistributionon X

fromtheconvexcredalset{PX:PX(A)≥Bel(A),∀A⊆ X }. 3. Maximumlikelihoodstrategiesunderincompleteinformation

Each matrix(M|p) or(M′|p′) isenough tounivocallycharacterizethejoint distributionof(X,Y). Foreachpair (k,j)∈ {1,. . . ,m}× {1,. . . ,r},let pkj denote thejointprobability pkj=P(X=ak,Y =Aj).According tothenomenclatureused in

theprecedingsubsections,therespectivemarginalsonX and Y aredenotedasfollows:

(8)

• pk.=P(X=ak)=Prj=1pkj willdenote themassof X=ak,foreveryk.

Now, letusassume that theabove jointdistribution ischaracterized bymeans ofa(vector of)parameter(s)θ ∈2(inthe

sense that entriesin M and M′ canbewrittenasfunctionsof θ).Wenaturallyassumethat thenumberofcomponents of

θ islessthanorequaltothedimensionofbothmatrices,i.e.,itislessthanorequaltomin{m× (r+1),r(m+1)}.Inother

words, theapproach uses aparametricmodel suchthat avalue ofθ determinesa jointdistribution on X× Y.When the

joint probability measure is theparametric distribution associated to the (vector of)value(s) θ of the parameter,we will

respectivelyusethenomenclature

• pθ kj=P(X=ak,Y=Aj;θ ), • pθ k.= Pr j=1pθkj=P(X=ak;θ ) and • pθ_.j=Pm k=1pθkj=P(Y =Aj;θ ).

Let us consider asequence Z= ((X1,Y1),. . . ,(XN,YN)) of N iid random variablesthat are “copies” of Z= (X,Y). We

willusethenomenclaturez= ((x1,y1),. . . ,(xN,yN))∈ (X × Y)N torepresent aspecificsample ofthevector(X,Y).Thus, y= (y1,. . . ,yN) will denote the observed sample (an observation of the vector Y= (Y1,. . . ,YN)), and x= (x1,. . . ,xN)

will denote an arbitrary artificial sample from X for the unobservable (latent) variable X , that we shall vary in XN. Let

(G1,. . . ,GN)bethesequenceofsubsetsofX thatcorrespondstotheobservedsampley (namelyif Yj=yj itcorresponds

to Xj∈Gj).

Wecandescribeanysample z infrequentisttermsassumingexchangeability:

• nkj=PNi=11{(ak,bj)}(xi,yi)isthenumberofrepetitionsof(ak,bj) insample z;

• Pm

k=1nkj=n.j bethenumberofobservationsofbj=Aj iny;

• Pr

j=1nkj=nk. bethenumberofappearancesofaj inx.

Clearly, Pm

k=1nk.=Prj=1n.j=N. Let thereader notice that, once a specific sample y= (y1,. . . ,yN)∈ YN has been

ob-served, the number of nkj repetitions of each pair (ak,bj)∈ X × Y in the sample, can be expressed as a function of

x= (x1,. . . ,xN). Moreover, inthe following, Xy denotes the collection offeasible marginalsamples (x1,. . . ,xN) of X , in

accordancewiththeobservationy:

Xy

= {

x

∈

XN

:

xi

∈

Gi

,

i

=

1, . . . ,N}

and likewiseZy denotethecollectionoffeasible(joint)samples(z1,. . . ,zN) of Z ,inaccordancewiththeobservationy:

Zy

= {

z

∈ (

X

×

Y

)

N

:

z_i

= (x

_i

,

y_i

)

and x_i

∈

G_i

,

i

=

1, . . . ,N}. 3.1. Whichlikelihoodfunction?

Wemay considerthreedifferent likelihoodfunctions (and theirrespectivelogarithms), dependingon whetherwerefer

to theobserved sample y= (y1,. . . ,yN), thesample of (ill-observed) outcomes x= (x1,. . . ,xN), orthe complete sample

z, and a fourth expression that interprets imprecise observations as events. We will use the following nomenclature to

distinguishthemfromeachother:

Visiblelikelihoodfunction p(y;θ )=QN

i=1p(yi;θ ) denotes theprobability of observing y∈ YN, assuming that thevalue of

theparameterisθ.Itcanbealternativelyexpressedasp(y;θ )=Qr

j=1(pθ.j)

n_.j_,_where_n

.j denotesthenumberofrepetitions

ofbj=Aj inthesampleofsizeN (thenumberoftimesthatthereportersaysthattheoutcomeoftheexperimentbelongs

to Aj.)Thelogarithm ofthislikelihoodfunctionwillbedenotedby

Ly

(θ ) =

log p(y

; θ ) =

N

X

i=1 log p(y_i

; θ ) =

r

X

j=1 n.jlog pθ.j

.

Wecallp(y;θ )thevisiblelikelihoodfunction,becausewecancomputeitbasedontheavailabledataonly,thatistheobserved

sampley.Itisalsosometimescalledthemarginallikelihoodoftheobserveddata intheEMliterature,nottobeconfusedwith

themarginallikelihood inaBayesiancontext(see [3],forinstance). Facelikelihoodfunction Note thatp(y;θ )differs fromthequantity

λ(

y

; θ ) =

r

Y

j=1

(9)

called the “face likelihood” in Ref. [9,26]. The latter quantitydoes not referto theobservation process, and replaces the

probability of reporting Aj as theresult ofan observation(i.e. P(Y =Aj)) by theprobability that the true outcomefalls

inside the set Aj, P(X∈ Aj). In particular, the occurrence of event “ X∈Aj” is a consequence of, but does not

neces-sarily coincide with the outcome “Y =Aj”. In our context, p(y;θ ) represents the probability of occurrence of the result

“(y1,. . . ,yN)=y”, given the hypothesis θ. Therefore given two arbitrary different samples y6=y′ the respective events

(y1,. . . ,yN)=y and“(y1,. . . ,yN)=y′”aremutuallyexclusive.Incontrast,λ(y;θ )denotestheprobabilityofoccurrenceof

theevent(X1,. . . ,XN)∈G1× . . . ×GN,whereGj=Aj if Yj=Aj.Eventsofthisformmayoverlap,inthesensethat,given

two differentsamples y6=y′,thecorresponding events(X1,. . . ,XN)∈G1× . . . ×GN and (X1,. . . ,XN)∈G′₁× . . . ×G′_N are

notnecessarilymutuallyexclusive.Thereforeλ(y;θ ) cannotberegardedasalikelihoodinthesenseofEdwards([15]).

However, underthe CAR assumptionand the assumptionof distinctnessofparameters, maximizingthe facelikelihood

is thesame asmaximizingthe visiblelikelihood. Jaeger ([25], Th. 2.18) pointsout that under CAR,thefollowing equality

obtains

P

(Y

=

A|X

=

x) =P

(Y

=

A|X

∈

A) = P

(Y

=

A) P

(X

∈

A)

.

ThisiseasytoseenoticingthatsincefromCAR, P(Y =A|X=x)=kA doesnotdependon x∈A,theequality

P

(X

=

x|Y

=

A) ·P

(Y

=

A) =P

(Y

=

A|X

=

x) ·P

(X

=

x) implies

X

x∈A P

(X

=

x|Y

=

A) ·P

(Y

=

A) =kA

· X

x∈A P

(X

=

x).

Hence P(X∈A|Y =A)·P(Y=A)=kA·P(X∈A),but P(X∈A|Y =A)=1.So, P(Y =A)onlydependsontheprobabilityof

event A onX.Wecandeducefromthisfactthat, undertheassumptionofseparabilitywithrespectto M (alsoreferredto

asdistinctnessoftheparameters)theargumentsofthemaximaofthevisibleand thefacelikelihoodsdo coincide.Further

comments aboutthisequivalencewillbeprovidedinSubsection4.5.

Latentlikelihoodfunction p(x,θ )=QN

i=1p(xi;θ )=Qmi=1(pθk.)

nk._, _where_n_k. _denotes _the_number_of _occurrences_of_a_k _in_the

sample x= (x1,. . . ,xN). Thissampleisneverobserveddirectly,byassumption.Howeveritmustbeinagreementwiththe

observedsampley.Eachvirtualsampleof X inXy yieldsapossiblelikelihoodfunctionp(x,θ )inagreementwiththeactual

observationsy.Thelogarithmofthislikelihoodfunctionwillbedenotedby

Lx

(θ ) =

log p(x

; θ ) =

N

X

i=1 log p(xi

; θ ) =

m

X

k=1 nk.log pkθ.

,

where nk.=Prj=1nkj and nkj is such that Pmk=1nkj=n.j, the numberof times Y =Aj has been observed. Note that the

definitionofnk. isinagreementwithx∈ Xy.Wecallp(x;θ )thelatentlikelihoodfunction,becausex isnotactuallyobserved,

norarethenk.’s,sinceonly then.j’sare.

Totallikelihoodfunction p(z,θ )=QN

i=1p(zi;θ )=Qmk=1

Qr

j=1(pθkj)nkj is thelikelihoodfunctioninducedbythewhole

artifi-cialsample z,wecallthetotallikelihood. Again,itmustbeinagreementwiththeobservedsampley= (y1,. . . ,yN),which

is fixedbyassumption.Each virtualsampleof Z inZy _yields _a_possible _likelihood_function_p₍_z_,_{θ )} _in_agreement_with_the

actual observationsy. Wewilldenoteitslogarithmby

Lz

(θ ) =

log p(z

; θ ) =

N

X

i=1 log p(zi

; θ ) =

m

X

k=1 r

X

j=1 n_kjlog pθ_kj

.

Maximizing p(z,θ ) allows usto introduce assumptions on themeasurement process, eitherin terms of imprecisiationor

disambiguation.Namely,theconditionalprobabilitiesp_.j|k. maybeknown because,forinstance,thesupersetassumptionis

made. Alternatively,probabilities p.k|.j could beknown,which, alongwith aparticular distributionon Y (tobeestimated

from observations y), is enough to derive a concrete distribution on Z and therefore on X. More generally, there may

be some dependence between the process driving the latent variable X and the measurement process driving theactual

observationsy.Inthiscase,maximizingLz(θ )enablesthiskindofadditionalinformationtobeaccountedfor.

Remark3.1.Intheaboveexpressions,weusetheconvention00=1.Inotherwords,theexpressionQm

k=1

Qr

j=1p

nkj

kj replaces

theformallycorrectexpression

Y

(k,j)∈{1,...,m}×{1,...,r} :nkj6=0

(10)

Example2. Consideragain Example 1, i.e.the cointossing experiment,assuming for 10 tosses that Peter reports 4times

Heads,2timesTailsand 4timesnothing.Letuswritethefourlikelihoodfunctions.

• Visiblelikelihood: p(y,θ )=P({{h}})4·P({{t}})2·P({{h,t}})4=£[(1−

α

)p]4[(1−

α

)(1−p)]2

α

4¤

using the parameters

introducedearlier. Notethat P({{h}})+P({{t}})+P({{h,t}})=1 asitisaprobabilitydistributionon 2{h,t}.

• Facelikelihood:p(y,θ )=P({h})4·P({t})2=p4(1−p)2 since P({h,t})=1.Optimizingitcomesdown toforgettingthe

missinginformation.

• Hidden likelihood: p(x,θ )=p4+n13_{· (}₁₋_p₎6−n13_, _where _n

13 isthe unknown number oftimes Peter does not report

whentheresultisHead.

• Totallikelihood:p(z,θ )=£[(1−

α

)p]4_[(₁₋

_α

₎₍₁₋_p_)]2₍

_α

_p₎n13₍

_α

₍₁₋_p₎₎4−n13¤

.Thefouroutcomesobtainedby

describ-ingboththetossresultand thereportareevaluated.

3.2. Maximumlikelihoodstrategies

Inthispaperwewill comparedifferent existingstrategiesoflikelihood maximization,basedon asequenceof

observa-tionsy= (y1,. . . ,yN)∈ YN:

• MaximizingLy_{(θ )}_{. The} _argument _of _the _maximum _of _Ly _considered _as _a_mapping _defined _on ₂ _is _called _maximum

likelihoodestimator(MLE)i.e.:

ˆ

θ =

arg max θ ∈2L y

_{(θ ) =}

_{arg max} θ ∈2 r

Y

j=1

(p

θ_._j

)

n.j

_.

Notethatthismaximizationprocessdoesnotneedanyreferencetothenon-observedvariable X .FromoptimizingLy(θ ),

whatis obtained is a probability distribution on Y, which, asalready suggested can also beviewed as aDempster–

Shafer massassignment mθ on ℘ (X ), letting mθ(Aj)=p_.θj for j=1,. . .r. But a concrete choice of θ ∈2 also leads

us to select a specific joint distribution on X × Y (pθ

i j)i=1,...,m;j=1,...r. When the argument of the maximum of the

log-likelihoodfunctionLy isnotunique,theMLEdetermines acollectionofjointdistributionson X× Y. Undersome

circumstances,thiscollectionofjointdistributionscoincideswiththecredalsetassociatedtoamassfunction,andsuch

amass functiondetermines a unique distribution on Y. We will provide a brief discussionabout such a situation in

Example 3.TheEMalgorithm[29]isaniterativetechniquethatuses alatentvariable X inordertoreachalocal

maxi-mumofLy _when_its_optimization_is_tricky._In_this_case,_we_also_obtain _a_precise_imputation_x_{∈ X}y_._The_latent_variable

issometimesfictitious,asinthecaseoflearningamixtureofnormaldistributions[11].

• MaximizingLx_{(θ )}_._This_is_the_genuine_goal_if_one_is_interested_to_find_the_MLE_of_{X despite}_the_imprecise_data._However,

sincetheprecisesample x isnotavailable, thereisasubset LXy(θ )= {Lx(θ ):x∈ Xy} ofpossible likelihood functions

[18].So wemustfindnotonlyanoptimalvalueofθ, butalsoanoptimalsamplex,accordingtosome strategy.There

aretwoobviousstrategiesthatcometomind:

1. Maximaxstrategy:findapair(x∗∗,θ∗∗)∈ XN× 2satisfying

(

x∗∗

, θ

∗∗

) =

arg max x∈Xy_{,θ ∈2}L x

_{(θ ) =}

_arg _max x∈Xy_{,θ ∈2} m

Y

k=1

(p

θ_k_.

)

nk.

_.

It comes down to maximizing anupper log-likelihood function Lx(θ )=max{Lx(θ ):x∈ Xy}, which can be viewed

as an optimistic strategy; it tends to favor distributions with small entropy, under certain conditions, aswe shall

seelater.Themaximax techniquehas beenproposedbyE.Hüllermeier([22])usingmoregenerallossfunctions.His

papermakesthepointthatthechoiceofanoptimalpair(x∗∗,θ∗∗)leadstoasimultaneousselectionofabestmodel

togetherwithadisambiguationoftheimpreciseobservations.

2. Maximinstrategy:findapair(x∗∗,θ∗∗)∈ XN× 2satisfying:

(

x∗∗

, θ

∗∗

) =

arg max θ ∈2xmin∈XyL x

(θ ) =

arg max θ ∈2xmin∈Xy m

Y

k=1

(p

θ_k_.

)

nk.

_,

wherenk.=PNi=11{ak}(xi) denotesthenumberofrepetitionsofak inthesamplex.Itcomesdowntomaximizinga

lowerlog-likelihoodfunctionLx(θ )=min{Lx(θ ):x∈ Xy},whichcanbeviewedasarobuststrategy,thatcopeswith

theimprecisionofthelikelihoodfunction;ittendstofavordistributionswithlargedispersions,asweshallseelater.

Themaximintechniquehasbeen proposedinRef.[18].

Notethat onemight objecttothese approaches.First,considering Lx(θ )or Lx_{(θ )} _requires_the_comparison_of_values_of

Lx_{(θ )} _for _several _samples _x,_which _maximal _likelihood _advocates _will _strongly _question. _Following _them, _one _cannot

comparelikelihood functionscomingfrom distinctdata sets[15].Howeverone mayreplytoitthat inthecaseofour

(11)

Thisdatasetisuniquebutill-known,andtheconsideredsamplesareinagreementwiththesamebodyofobservations

y.It isthensurethatthetrue likelihoodfunctionliesintheinterval [Lx(θ ),Lx(θ )].Maximizingoneof itsbounds isa

usualstrategyinthefaceofintervals.Wemayalsouseanyotherstrategythatcomparesintervals,forexamplethesafe

butverydemanding(ifnotimpossibletoreach)partial intervalorderingrequirement Lx(θ⋆)>Lx(θ ),∀θ 6= θ⋆_.

Hüllermeier[22]justifiesthemaximaxapproachbysayingthatifLx1(θ1)>L

x2

(θ2)thenthesamplex1isarguablymore

plausiblethanthe sample x2,simplybecause thefirstinstantiationallowsforamuchbetterfittothemodelbasedon

θ1 than the second one based on θ2. This philosophyleads to disambiguating theimprecise data through the choice

ofthebest model. However, thisline ofreasoning makessense if we aresure that therandom process generating X

followsamodelintheprescribedclassparameterizedbyθ.Thenitisnaturaltoconsiderthatthemostplausiblevalues

compatible withthe impreciseobservationsare those which enablea best fitwith theclass ofparameterized model,

sothatonemayselectatthesametimethebestmodelandthebestsample thatjustifiesit.Inthatcase,theresulting

disambiguationis a form of data reconciliation [14]. In contrast, if theset of parameterizedmodels is chosen for its

computationalsimplicityand isknown to beanapproximationof therealphenomenon, thedisambiguationrationale

ofthemaximaxapproachisthennotsostrong.

• MaximizingLz(θ ).Assaidearlierthisisthenaturalwaytogo, ifsomeinformationregardingthedependencebetween

thelatentvariable X anditsmeasurementprocessisavailable,forinstancethesupersetortheCARassumptionismade.

Wecanagain adoptmaximax ormaximinstrategies,sincethefullsamplez isnotavailable,andonly theobservations

y are. There is also an iterative strategy that exploitsthe links between X and Y , such as the EMalgorithm, which

maximizes Ly_{(θ )}_via_the_production_of_a_fake_sample_z.

1. Themaximaxstrategyaimsatfindingthepair(z∗,θ∗)∈ ZN_{× Ä}_that_maximizes _the_function_Lz_{(θ )}_:

(

z∗

, θ

∗

) =

arg max z∈Zy_{,θ ∈2}L z

_{(θ ) =}

_arg _max z∈Zy_{,θ ∈2} m

Y

k=1 r

Y

j=1

(p

θ kj

)

nkj

.

Itcomes down tomaximizingan upperlog-likelihoodfunction LZy(θ )=max{Lz(θ ):z∈ Zy}. Thecomplete sample

z∗alsoyieldsanoptimalsamplex∗∈ Xy _since_Lz_{(θ )}_can_be_viewed_as_a_function _f

y: XN× 2→ Rthatonlydepends

onx.Thismaximizationprocedurehasbeen consideredin[23]underthesupersetassumption.

2. Itisclearwecansimilarlyenvisagethecorrespondingmaximinstrategyandfindthepair(x∗,θ∗)∈ XN× Äinduced

bythepair(z∗,θ∗)that maximizesthelowerlog-likelihoodfunctionLZ

y (θ )=min{Lz(θ ):z∈ Zy_}_.

(

z∗

, θ

∗

) =

arg max θ ∈2zmin∈ZyL z

_{(θ ) =}

_{arg max} θ ∈2zmin∈Zy m

Y

k=1 r

Y

j=1

(p

θ_kj

)

nkj

_.

3. Guessaninitialvalue ofθ, whichenables toconstructafictitious samplez; then,anMLE ofθ forthissample can

befound,andthisprocessisiteratedtillconvergence.ThiskindofstrategyisadoptedbytheEMalgorithmand tries

tofindaprobabilitymodel pθ _as_close_as_possible_to_the_empirical_distribution_of_a_fake_sample_z_{∈ Z}y_,_in_the_sense

ofKullback–Leiblerdivergence[29](seealso[6]).

Inthispaper,wefocusonthemaximaxandmaximinstrategiesformaximizingLxand Lz.

3.3. ConnectionsbetweenMLEstrategies

Undersomeparticularconditionsabout thematrices M and M′_,_some_of_the_above _maximization_procedures_may

coin-cide.Below, someresultsareprovided.Therearetwokindsofresults:somethatrelate thetotal likelihoodfunctionp(z;θ )

and thelatent one p(x;θ ) under suitable assumptions, and those that relate the total likelihood function p(z;θ ) and the

visible one p(y;θ ). This is done by introducing assumptions about the incomplete data or the conditional distributions

describingthemeasurement process.Itcomes downtosomeinformationaboutthematricesM and M′.

A first issue concernsthe parameter θ, which so far isused inthe three likelihood functions asdriving the joint

dis-tribution on Z= X × Y, hencetherespective marginalson X and Y. Insome situations, X and Y are driven bydistinct

parameters θ1,θ2.

Thefollowingresultconcernsthedisambiguationpointofviewand involvesmatrix M′.

Definition1.We saythattheparameter θ ∈2isseparable withrespectto thematrix(M′_|_p′₎ _if_it _can_be_{“separated”}_into

two(maybemultidimensional)componentsθ1∈ 21,θ2∈ 22suchthat2= 21× 22,where p_k.|.θ _jand pθ_.j canberespectively

writtenasfunctionsofθ1 andθ2.

Proposition1.∪z∈Zyarg max_{θ ∈2}Lz(θ )⊆arg max_{θ ∈2}Ly(θ )providedthatθ isseparablewrt(M′|p′).3

3 _Remember_that_{arg max}

(12)

Proof. Let y ∈ YN _denote _the _observed _sample. _Let _us _select _an _arbitrary _complete _sample _z_{∈ Z}y_. _p₍_z_;_{θ )} ₌ Qr j=1 Qm k=1(pθkj)nkj= Qr j=1 Qm k=1(pθk.|.j·pθ.j)nkj= Qr j=1(pθ.j) Pm k=1nkj_·Qr j=1 Qm k=1(pk.|.θ j)nkj= Qr j=1(pθ.j)n.j· Qr j=1 Qm k=1(pθk.|.j)nkj.

Thus,if θ isseparablewrt(M′|p′)wecanwrite:

p

(

z

; θ ) =

r

Y

j=1

(p

θ1 .j

)

n.j r

Y

j=1 m

Y

k=1

(p

θ2 k.|.j

)

n_kj

.

Thus, if θ∗ isan optimalparameter suchthat Lz_(θ∗₎₌_max

θ ∈2Lz(θ ),thenits projectionon 21, θ₁∗∈ 21 mustnecessarily satisfytheequalityLy(θ₁∗)=maxθ1∈21L

y_(θ

1). ✷

Nowletuschecktheconsequenceoftheuniformconditionaldistributionassumption.

Proposition2.Lety= (y1,. . . ,yN)∈ YN denotetheobservedsample.LetussupposethatQrj=1

Q

k:nkj6=0p

nkj

k.|.j isavaluec that

does notdependonthe particularchoice ofz∈ Zy,noronθ.Then foreveryz∈ Zy wehavep(z;θ )=cp(y;θ )andtherefore arg max₍x,θ )∈Xy_×2p(z;θ )=arg max_{θ ∈2}arg min_z_∈Zyp(z;θ )=arg max_{θ ∈2}p(y;θ ).

Proof. Usingthepreviousproof,wealready havethat

p

(

z

; θ ) =

r

Y

j=1 pn.j .j

Y

k:nkj6=0

(p

_k.|.j

)

nkj

=

cp(y

; θ ).

✷

Proposition3.Lety= (y1,. . . ,yN)∈ YNdenotetheobservedsample.Letusconsidertheuniformconditionaldistribution

assump-tion.ThenQr

j=1

Q

k:nkj6=0p

nkj

.j|k.isavaluec thatdoesnotdependontheparticularchoiceofz∈ Zy,noronθ. Proof. Undertheuniformconditionaldistributionassumptionwehave:

pk.|.j

=

(

₁ # Aj if ak

∈

Aj 0 otherwise. Therefore, r

Y

j=1

Y

k:nkj6=0 pn_k_.|.kj_j

=

r

Y

j=1 1 # A_j P k:_nkj6=0nkj

=

r

Y

j=1

µ

₁ # A_j

¶

n.j

.

✷

Corollary4.Iftheuniformconditionaldistributionassumptionholdsthen,forallz∈ Zy,p(z;θ )=cp(y;θ ),wherec dependsneither ontheparticularz∈ Zynoronθandtherefore

arg max

(x,θ )∈Xy_×2p

(

z

; θ ) =

arg max_{θ ∈2}arg min_z_∈Zyp

(

z

; θ ) =

arg max_{θ ∈2}p

(

y

; θ ).

Thenextresultsconcerntheimprecisiationprocessandinvolvematrix M:

Definition2. Wesay that theparameter θ ∈2is separable with respectto thematrix (M|p) if it canbe“separated” into

two (maybemultidimensional)components θ3∈ 23,θ4∈ 24 suchthat 2= 23× 24 and pθ_.j|k. and pθ_k. canberespectively

writtenasfunctionsofθ3 andθ4.

This typeofseparabilitycorrespondsto thenotion of“distinct parameters”inthe sense ofHeitjanand Rubin ([21]) in

thecontextofcoarsedata,andLittleand Rubin([28])inthecontextofmissingdata.

Proposition5.Ifθ isseparablewrt(M|p)then,givenaspecificsamplex∈ XNandthecorrespondingz∈ (X × Y)N inducedbyx

andy,arg max_{θ ∈2}Lz_{(θ )}_⊆_{arg max}

θ ∈2Lx(θ ).

Proof. Theproofofthisresultissimilartotheone giveninProposition 1. ✷

Remark3.2.Proposition 5 assumes afixed sample x∈ XN. Let usnotice that theseparability wrt M does notimply that

the respective solutionsof bothmaximax problems,θ∗∗=arg maxθ ∈2L

Xy

(θ ) and θ∗=arg maxθ ∈2L

Zy

(θ ), coincide. They

(13)

Proposition6.Lety= (y1,. . . ,yN)∈ YNdenotetheobservedsample.LetussupposethatQmk=1

Q

j:nkj6=0p

nkj

.j|k.isavaluec thatdoes

notdependontheparticularchoiceofz∈ Zy,noronθ.Then,forallx∈ Xyandthecorrespondingz∈ Zywehavep(z;θ )=cp(x;θ )

andthereforearg max₍x,θ )∈Xy_×2p(x;θ )=arg max₍_x_{,θ )∈X}y_×2p(z;θ ). Proof. p

(

z

; θ ) =

m

Y

k=1

Y

j:nkj6=0 pn_kjkj

=

m

Y

k=1

Y

j:nkj6=0

(p

.j|k.

·

pk.

)

nkj

=

m

Y

k=1

Y

j:nkj6=0 pn_._jkj_|_k_.

·

pn_k_.kj

=

m

Y

k=1 p Pr j=1nkj k.

·

m

Y

k=1

Y

j:nkj6=0

(p

.j|k.

)

nkj

=

m

Y

k=1 pnk. k.

·

c

=

p(x

; θ ) ·

c.

✷

Proposition7.Lety= (y1,. . . ,yN)∈ YN denotetheobservedsample.Letussupposethat{A1,. . . ,Ar}formsapartitionofX or

thatthesupersetassumptionissatisfied.ThenQm

k=1

Q

j:nkj6=0p

nkj

.j|k.isaconstantc.

Proof. On one hand, we can easily check that, if {A1,. . . ,Ar} forms a partition of X then Qmk=1

Q

j:nkj6=0p

nkj

.j|k.=1. Now,

let us check that the above condition holds under the superset assumption. Under the superset assumption we have

already shown that p_.j|k.=2m−1 if ak ∈Aj, and 0 otherwise. Therefore, Qmk=1

Q j:nkj6=0p nkj .j|k.= Qm k=1 ³ 1 2m−1 ´P_j_:_nkj₆₌₀nkj = Qm k=1 ³ 1 2m−1 ´nk. =³₂m1−1 ´N . ✷

Corollary8.Ifanyofthefollowingconditionsissatisfied:

• {A1,. . . ,Ar}formsapartitionofX

• Thesupersetassumptionholds

thenp(z;θ )=cp(x;θ )andthereforearg max₍_x_{,θ )∈X}N_×2p(x;θ )=arg max₍_x_{,θ )∈X}N_×2Lz(θ ).Furthermorec=1 inthefirstcase.

Mostapproachesinstatistical inferenceinsiston thenecessitytohave astatisticalmodel oftheobservationprocess. In

that case thenatural likelihood functiontomaximize is p(z,θ ). Incontrast, if wemaximize the latent likelihoodfunction

p(x,θ ) directly with respect to both x and θ, we in some way give up the idea of providing a statistical model for the

observation(imprecisiation) process.Thesupersetassumption canbeused tojustifytheuseofp(x,θ ) byprovidingsucha

statistical model,since inthat case maximizingp(z,θ ) is thesameasmaximizing p(x,θ ) (this is themessage apparently

carriedbytheauthorsof[23],forinstance).Whennoinformationaboutthemeasurementprocessisavailable,itistousan

openquestionwhetheroneshouldmaximize p(x,θ )orp(z,θ )withrespecttobothx andθ.Indeed,thetwocorresponding

MLEmaydifferasindicatedonexamplesinthefollowing.

4. Comparingthemaximumlikelihoodstrategiesonexamples

Based on some of the results provided in Subsection 3.3, we can compare the various choices of likelihood functions

(maximization of Ly_{(θ ),}_Lx_{(θ ),}_Lz_{(θ )}₎_in_an _imprecise_environment,_on _the _basis_of _the_{acceptability}_of_results _obtained _by

theirmaximizationonanumberofprototypicalexamples.Theseexamples,shedlightonthenatureofthemaximinandthe

maximaxstrategies,asopposedtomaximizingthevisiblelikelihoodfunction.Weconsiderseveralsettingswhereimprecise

dataoccurinvariousways:thepossibleobservationsmayformapartitionofspaceX,thedataiseitherpreciseormissing,

and thegeneral casewhereimprecisedatamayoverlap.

4.1. Observationsformingapartition:separablecase

ThefirstexampleillustratesthesituationwhereimprecisedataformapartitionofX and theparameter θ isseparable.

Example3. Consider therandom experiment that consistson rolling adice. We do not knowwhether the diceis fair or

not. Suppose that theperson that rollsit justtells uswhether theoutcomeiseven or odd.Let X betherandom variable

denoting the actual outcome of the dice roll (ai=i for i=1,. . . ,6) and let Y be a binary variable taking the values b1

(odd) and b2 (even). So, the collection {A1= {1,3,5},A2= {2,4,6}} determines a partition of the whole set of possible

outcomesX= {1,. . . ,6}.Letthe6-dimensionalvector(p1.,. . . ,p6.)representtheactual(unknown)probabilitydistribution

of X ,where pi.=P(X=ai),i=1,. . . ,6 and p6.=1−

P5

i=1pi..

Let p_.2=

π

=p2.+p4.+p6. denotetheprobabilityofgettinganeven number.Thesix-dimensionalvector(p1.,. . . ,p6.)