• Aucun résultat trouvé

Data management in precision farming

N/A
N/A
Protected

Academic year: 2021

Partager "Data management in precision farming"

Copied!
25
0
0

Texte intégral

(1)

HAL Id: hal-02788369

https://hal.inrae.fr/hal-02788369

Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution - ShareAlike| 4.0 International License

Data management in precision farming

Pierre Blavy

To cite this version:

Pierre Blavy. Data management in precision farming. Data management in precision farming (Data management in precision farming), 2017. �hal-02788369�

(2)

Data management in precision farming

Pierre Blavy, INRA

(3)

Adequation between question and data

What can I do with existing data? (meta analysis) What’s the question ?

(4)

Adequation between question and data

Practical decision

I Real time (Importantn for large farms) F On farm alerts (vêlage, chaleurs, . . . )

I Long term

F Herd management tools

F Planning asistants V.S. Research question

I Impact of feeding on production/environment I Gentic selection on food efficiency?

(5)

Costs v.s. benefits

Plonk et Replonk

⇒ Cost v.s. benefits tradeoff

I money I time

(6)

Get consistent data

Farm proof systems (machine robustness) Know what you measure

I What’s the data (definition, format, unit) I In which context it’s measured / usable ?

(7)

Organize data

Use databases (SQL is your friend)

I SQL https://www.w3schools.com/sql/ I SQLite in R http://cpc.cx/iZj

Maximize automation (program boring repetitive stuff)

I With R http://cpc.cx/kQJ I With python http://cpc.cx/kQK I Others softwares are OK

Write documentation

I Someday, it will save someone I Who may be you.

(8)

Watch your data

Make plots I R http://cpc.cx/kQL I R+ggplot2 http://cpc.cx/kQM I R+shiny https://shiny.rstudio.com/ Descriptive stats

I mean, median, in, max. . . I proportion of missing values I R summary

(9)

Principal Component Analysis

Sadly not my little sister

PCA maximize the projected variance

I Orhtogonal axis

I R, factominer http://cpc.cx/kQQ

PCA is a good tool for

I Summarizing data (check inertia) I Finding outliers (check individual

weight)

I Constructing synthetic variables

(check axis contribution)

PCA is NOT a tool for

I Acurate correlations ⇒ watch the

correlation matrix

(10)

Data preprocessing

(11)

Data preprocessing

Calvin and Hobbs, Bill Watterson

Remove unrelevant variables

I Remove unrelated variables (spurious correlation) F Crucial on datasets with many variables

I Remove a posteriori variables

F crash survival day7 = alcool+ crash speed + reeducation duration Remove unrelevant data

I Out of scope data (ex : holstein model, montbeliarde data) I Outliers (PCA, boxplots, . . . )

F Wrong data (ex broken sensor) ⇒ discard

(12)

Data preprocessing

Data smoothing

OK when we expect few changes within neighbors (time, space) Methods

I kernel methods (ex : rowing average) I rowing median

I splines I . . .

(13)

Data preprocessing

median filter, black 5 points, red 17 points

Smoothing strength is a compromise.

That depends on your question/model.

(14)
(15)
(16)
(17)
(18)

Aggregate variables

When?

The same thing is measured multiple times

I Ex : captor1 and captor2

I Rare gold standard v.s. frequent low quality measurments

Strong correlation between variables

I Ex behavioural traits

How?

Choose a variable

Do a variable "mix" (ex linear combination) PCA : synthetic variables (a linear combination)

(19)

Is my model OK or garbage?

A model can be anything

I Linear model http://cpc.cx/kQE , http://cpc.cx/kQF I Non linear model http://cpc.cx/kQG

I A classifier http://cpc.cx/kQR I . . .

I The devil himself in a unicorn disguise

⇒ Garbage in, garbage out

(20)

Typical evil models : extrapolation

xkcd.com, Randall Munroe, CC-By-Cn2.5

(21)

Typical evil models : Overfitting

(22)

Typical evil models : outliers

Outlier with heavy weight ⇒ Check for outliers Check your question

I General model : keep Hulk, rename axis I Dwarf model : discard Hulk

WARNING : evil models exists in unexpected cases

(23)

Test your models

detect mine detect rock detect mine detect rock mine 9 1 mine 5 5 rock 50 50 rock 10 90

Two mines detectors, what’s the best one?

All mistakes are NOT equivalent

I Be carefull, some methods(like area under ROC curve) say so I When possible use a real-life criterion (human lifes, money, . . . ) I When not, define a a reasonable criterionapriori.

(24)

Test your models

What the difference between model prediction and model output

I Plot observed v.s. predicted I Model fitting (ex sum of square)

I Cross with other knowledge source (common sense, litterature,

meta analysis . . . )

Model testing

I Use anindependant dataset

I Test the model on it, according to your criterion I See bootstrap like methods.

(25)

Références

Documents relatifs

The question of how many machines are desirable depends partly on how efficiently their use is organ- ized. A comparatively few machines can do more work than

Cur- rent access control systems are based on a predened set of static rules in order to prohibit access to certain elds of a data repository; these access control rules cannot

data for reuse implies data will be shared: storing/curating/sharing are sometimes conflated – adversely affecting not only sharing, but also use of data management tools Trends:.

“The place and function of Venus in Ovid...”. “Computed backscattering function of Venus and

Lots of data sources can be seen as intensional: accessing all the data in the source (in extension) is impossible or very costly, but it is possible to access the data through

Lots of data sources can be seen as intensional: accessing all the data in the source (in extension) is impossible or very costly, but it is possible to access the data through

Data 1/0 has become the world's largest supplier of microcircuit programming equipment by working closely with you and the people who design and manufacture the

The Binary Input Program handles an fb tape, whereas the Conversion Program or the Generalized Post-Mortem Program (both Drum Utility Programs) are brought in by the