• Aucun résultat trouvé

Presentation

N/A
N/A
Protected

Academic year: 2022

Partager "Presentation"

Copied!
21
0
0

Texte intégral

(1)

Syllable-based

compression for XML

Katsiaryna Chernik,

Jan Lánský, Leo Galamboš Dept. of Software Engineering

Faculty of Mathematics and Physics Charles University

(2)

Content

|

Motivation

|

Syllable-based compression

|

XMLSyl

|

XMillSyl

|

Results

|

Conclusion

(3)

Motivatoin

| XML

z Simple text format for structured text documents

z Data exchange standard

z Hight redundancy

(4)

Compression Methods for XML

| Character-based

z XMill

z XMLPPM

z XGrind, ...

| Word-based ?

| Syllable-based ?

(5)

Syllable-based compression

| LZWL

z Dictionary-based method

z Syllable-based version of LZW

| HufSyl

z Statistical method

z Adaptive Huffman coding

z Inspired by HuffWord

(6)

Syllable-based compression

| Syllable-based compression is suitable for languages with rich morphology (Czech)

| Syllable-based compression is

suitable for small or middle-sized files

(7)

Syllable-based compression of XML

| Majority of XML documents are small or middle-sized

| Many text-like XML documents

z news in RSS format

z documentations or books in DocBook format

Syllable-based compression and XML?

(8)

XMLSyl

Idea

| Syllable-based compressor

z XML tokens are divided to many syllables

| XMLSyl

z XML tokens are treated as single syllables

(9)

XMLSyl

Architecture

SAX Parser Structure Encoder

Element Container Data and Structure Container XML Document

Attribute Container

Compressed XML document

Syllable Compressor Syllable Compressor

(10)

XMLSyl

Example

XML doc:

<book>

<title lang="en">XML</title>

</book>

SAX events:

startElement(“book”)

startElement(“title”,(“lang”,”en”)) characters(“XML”)

endElement(“title”) endElement(“book”)

(11)

XMLSyl

Example – Encoding process

SAX events:

startElement(“book”) characters(“XML”) endElement(“title”) endElement(“book”)

Attribute Container Element Container

Data and Structure Container book E0

title E1

lang A0

E1

E0 A0 en END_ATT

CHAR XML END_CHAR END_TAG END_TAG startElement(“title ,(“lang”,”en”))

(12)

XMLSyl

Implementation details

| SAX parser – EXPAT

| Syllable Compressor – LZWL and HufSyl

| Encoding was inspired by existing XML compression methods

z XMLPPM, XGrind, XPress, XMill

(13)

XMillSyl

| Based on XMill

| Main principles of XMill

z Separating structure from data

z Grouping Data values with related meaning

(14)

Architecture of XMill

SAX Parser

Path Processor

Structure Container Data Container1 Data Container k Input XML file

Data Container 2

Compressed XML file

gzip gzip gzip gzip

(15)

XMill – one container

SAX Parser

Path Processor

Structure Container Data Container Input XML file

Compressed XML file

gzip gzip

(16)

XMillSyl

Architecture

SAX Parser

Path Processor

Structure Container Data Container1 Data Container k Input XML file

Data Container 2

Compressed XML file

gzip LZWL / HufSyl LZWL / HufSyl LZWL / HufSyl

(17)

Syllable-based compression of XML Experimental results

XMLSyl & XMillSyl vs. LZWL & HufSyl

| Non-textual XML data

z 50-60% better

| Textual XML data

z 10-20% better

(18)

Syllable-based compression of XML Experimental results

XMillSyl more containers XMillSyl one container

XMLSyl

Text-like XML documents

(19)

Syllable-based compression of XML Experimental results

XMLSyl

| XMLHuf is suitable for small-sized files

| XMLzwl is suitable for large-sized files

XMLSyl vs. XMill

| On average 10-15% worse than XMill

| On some documents the same performance or better

Text-like XML documents

(20)

Conclusion

| New syllable-based compression methods of XML

z XMLSyl (versions: XMLzwl, XMLhuf)

z XMillSyl (versions: XMillzwl, XMillhuf)

| One of our method outperforms XMill on some documents

(21)

Conclusion

| Future work

z extract and utilize the information in the DTD section

z create a special syllable dictionary for elements and attributes

z compress HTML data

Références

Documents relatifs

First when a node that references an external file is reached during the file parsing step (1), the format wrapper of the container file calls the kernel (2).Then the kernel loads

Array(2,N22) Air moisture or vapour pressure for left and right (IMOIS = 1, read from file) nodes COMMON/THERMAL/ in Common.for Read in input file or calculated using weather

Raw XML file The metadata are properly encoded, the texts are given as they were crawled, with no enrichment whatsoever.. XML file

MR - MOOSE requires four input files: a data file, a model file, a setting- file, and the filter data base each containing the necessary informa- tion to perform the fit (see Table

comme du texte standard XML XML est est simple simple XML XML peut représenter peut représenter des des données complexes données complexes La structure La structure des fichiers

The implementation is based on a systematic transformations of XML: (1) a text-to-model transformation of an XML input into an XML meta-model, (2) a model-to-model transformation of

2 http://www.flickr.com/.. to 150 photos per location; devset contains 5,118 images while testset 38,300 images) 3 , an xml file containing meta- data from Flickr for all the

„ Text Files - Compression over alphabet of words or syllables... Influence of