Materials and methods - Mise en œuvre d’un système de détection de fraude et de falsification d

5.3.1 Dataset

Since a text document image database is not available, we have created our database using the images of the original documents available at our level and others downloaded on the net.

We have used all types of text images, namely images containing nothing but texts and others with tables with the following specifications :

• the size of the document varies between is 693 x 241 and 1240 x 1752 pixels ;

• the resolution of the documents is 400 dpi.

• the images are saved under the PNG extension.

• the color level is in RGB.

• some documents have two logos, a header, a footer, and a signature.

We used three (03) types of original documents, each type being manufactured five (05) times, then the total number of datasets is 30 documents.

To build the dataset of false images by CM, we use the Photoshop software because of its high potential for editing, processing images and documents. To create the images, proceed as follows :

• digitize documents in color with a resolution of at least 72 dpi ;

• create false documents by deleting certain words or characters and replacing them by others of the same document.

5.3.2 Choice of implementation tools

Anaconda: is a free and open source distribution of python and R programming languages.

It is a simple way to develop applications that can use data science, machine learning, large scale data processing, predictive analytics, .. on Linux, Windows and Mac OS X. It simplifies the management and deployment of packages and the package versions are managed by the management system. package conda.

Anaconda (Open Computer Vision) : is a free graphics library, originally developed by Intel, specializing in image processing. OpenCV was designed for efficiency calculations and focuses on real-time applications. It is free for academic and commercial use. It has C ++, Python and Java interfaces and supports operating systems like Windows, Linux, Mac OS, iOS and An-droid.

PyCharm: is an integrated development environment used to program in Python. It offers code

analysis, a graphical debugger, unit testing management, versioning software integration, and supports web development with Django, Flask and others.

Flask: is an open-source web development framework in Python. Its main purpose is to be lightweight, to keep the flexibility of Python programming, associated with a template system.

It is distributed under BSD license ie free to use.

HTML(HyperText Markup Language) : is the markup language designed to represent web pages. It is a language for writing hypertext, hence its name. HTML can also semantically and logically structure and format the content of pages, including multimedia resources including images, input forms and computer programs. It allows you to create interoperable documents with a wide variety of equipment in accordance with the requirements of web accessibility.

CSS(Cascading Style Sheets) : is a language that describes the style of an HTML document.

It is used to format web documents. Thanks to the properties of appearance (colors, borders, fonts, etc.) and placement (width, height, side by side, top-bottom, etc.), the rendering of a web page can be completely modified without any code extra in the web page.

JavaScript(which is often abbreviated as "JS") : is a light, object-oriented scripting language, mainly known as the scripting language of interactive web pages but also for servers with the use of Node.js. The JavaScript code is interpreted or compiled on the fly. It is an object language using the concept of prototype, with a weak and dynamic typing that allows programming ac-cording to several paradigms of programming : functional, imperative and object oriented.

Bootstrap: is a collection of development tools useful to the creation of the design (graphics, animation and interactions with the page in the browser ...) of sites and web applications. It’s a set that contains HTML and CSS codes, forms, buttons, navigation tools and other interactive elements, as well as optional JavaScript extensions. It is used to create mobile web projects, suitable for any type of screen.

5.3.3 Our approch

5.3.3.1 Pre-processing

For successful detection a pre-treatment phase is needed to make the image favorable to the treatment stage. It improves the quality of the image and other elements that can disrupt the smooth course of transformations to be applied later. In our method, the pre-treatment stage is subdivided into three (03) phases :

• convert the image to grayscale ;

• divide the image to overlap block ;

• apply HDWT.

Grayscale

For the transformation of a color image into an image in grayscale, one does not generally take the arithmetic mean of the fundamental color intensities, but a weighted average. The standard formula giving the gray level according to the three components is :

gris=int(round(0.299rouge+ 0.587vert+ 0.114bleu)) (5.1) Apply HDWT

The main purpose of the wavelet transformation is to decompose a signal into a fixed base func-tion. These features are called wavelets. Mother wavelets are defined as wavelets obtained from a single model by shift and dilation. DWT divides the signal into low and high frequency parts.

The low frequency contains the raw signal information and the high frequency part contains the information about the edge components.Divide to overlap

In this study, the approximation sub-band (LL) is selected. During this phase, the sub-band (LL) is divided into a fixed-size superimposed block. Let an image f of sizeM xN, letLxLbe the size of the block and letB_ij be a block, we have :

B_ij(x, y) = f(x+j, y+i),

où x, y ∈ {0,· · · , L−1}, i∈ {0,· · ·, M −1} et j ∈ {0,· · · , N −1} (5.2) 5.3.3.2 Matching

The match is to find a possible duplicate region in the image. The main idea of the copy-move falsification detection method based on lexicographic sorting consists in comparing the eigenvectors of two regions. The rows of the matrix formed by the characteristic vectors are sorted in lexicographic order. Each pair of lines is compared. If the difference between the two rows is less than a threshold T1, the two blocks are considered similar. If the number of si-milar blocks in some regions exceeds the thresholdT2, it is considered that the falsification is identical. But the algorithm could find too many matching blocks, including the false matching blocks. In order to improve the precision of the pairing, a lexicographic sorting algorithm based on distance is used.

5.3.4 Evaluation Metrics

The performance and reliability of CMFD methods are generally evaluated at image and pixel levels [30]. The first approach, ie at the image level, focuses on the ability the detector to identify the alteration of the image, while the second ie at pixel has the ability of the detector to accurately locate cloned image patches. Since inter and intra-glyph similarity is a property intrinsic to all document images, the pixel-level evaluation method is used. For our study, we focus on three major performance parameters at the pixel level of the image for determine whe-ther an image has been altered or not. These are the precision (P), the number of false positives (NFP) and the number of false negatives (NFN) that are performance evaluation parameters to evaluate the effectiveness of CM detection systems by block.

m Number of True Positive (NTP) : represents the number of pixels properly classified as duplicated in the actual number of duplicate pixels.

m Number or False Positive (NFP) : Indicates the number of pixels incorrectly classified as duplicated in the detection map relative to the number of unmodified pixels in the reference map.

m Number of False Negative (NFN) : The number of pixels that have been falsely missed but have been forged.

m Number of True Negative(NTN) : number of pixels correctly detected as not forged.

m Precision(P) : Provides the quality of patch location based on true positive rates and truly negative rates. It is given by the equation 5.3.

P =

We made our implementation on different images whose sizes vary between 693 x 241 and 2583 x 1163. The table 5.4.1 presents the results of the implementation on thirty (30) images taken among our data.

T^ABLEAU5.1 – Result of implementation on different text document images.

Dans le document Mise en œuvre d’un système de détection de fraude et de falsification de documents scannés. (Page 70-73)