are simply not available; or data, methodology, tools, primary sources are mingled, not or badly indexed, or not linked with the text.
Description and preservation of digital objects are part of the work of traditional academic libraries. For this reason, they generally consider research data curation and management as a new challenge, a kind of new frontier for the development of their campus services, either on a local level or as part of a scientific network (CLIR 2013). For the same reason, we started to work on the topic from 2013 on. Empirical results and recommendations are based on our research at the University of Lille 3, a large social sciences and humanities campus in the Northern part of France, with 19,000 students and nearly 500 PhD candidates in three graduate schools and 55 doctoral degrees. The project is going on.
Today, the rapid development of data-driven research (e-Science) and the debate on open data and re-use of research results has led us to discover another challenge in the field of PhD dissertations, beyond the debate on open access and embargo, i.e. the existence of large amounts of small data produced by the PhD candidate and partly submitted together with the text of the dissertation. These small data are the topic of our paper. We wonder how these data can be made available in the context of open access and open data policies, what are the potential barriers, and how academic libraries could contribute to this challenge. Our exploratory study is part of a digital humanities research project on ETDs and research data, with two objectives: (a) create a campus-based service together with the academic library, the graduate school and research laboratories, to assist PhD students in research data management (RDM), preservation and dissemination of their research results; and (b) develop content mining tools for the further exploitation of these data.
At least in our sample, the academic libraries appear to be more interested and competent in OA than graduate schools or committees (PA) which are officially not directly concerned with regard to the processing and dissemination of PhD theses. They do not control and/or follow-up document processing statistics (and do not communicate between each other about the topic). Even if individual members of a graduate school or committee may express personal attitudes on OA, no significant institutional opinion or policy can be identified. Sometimes, their information may be simply misleading. For instance, one respondent stated that “all dissertations of my faculty are of course fully OA. Anything else would not make sense nowadays.” However, the analysis of the institutional repository’s statistics revealed that the real number of OA theses is only 50% of all, which means that this PA does not know the candidates’ decisions.
• Aims: A selection of scientific papers is introduced as a modest addendum to Hartley’s
Academic writing and publishing: A practical handbook (2008). These materials are
masterpieces of unconventional academic writing and publishing.
• Methods: I collected unconventional academic papers during the past few years from a variety of sources: readings, informal chats with colleagues at the coffee machine, online forums, social media, and so on. Most I found serendipitously. Each paper was systematically filled in a folder of my computer upon encounter. As February 2015 was approaching I edited all these to form the present paper, following the outline of (Hartley, 2008). No significance tests were used. No subjects were harmed whatsoever.
Although there exist academic and national boards, e.g., JISC [ 6 ] (Launch of Jisc Plagiarism Advisory Service) in the United Kingdom [ 1 ], a universal standard would be desirable in the scientific community, for the rate of allowable similar and matched content—referring especially to so-called “self-plagiarism” [ 7 , 8 ] or copying of general phrases in the introduction section [ 9 , 10 ]—for an ethically fair and equitable treatment of authors. This mainly implies a wide consensus between journals, editors, authors and institutions. With this goal, we recommend the creation of a representative committee to propose appropriate common tools and standards for measuring matched content and similarity rate in scientific documents. In a sophisticated version, the standard might be adjusted by considering the field of research, as using technical words in science is unavoidable, and, hence, the rate of similarity might be automatically higher. This is also the reason why author names, affiliations and the reference list are obviously excluded from the similarity analysis. Furthermore, the similarity in some parts of articles such as introductions or methods could be weighed differently from results and discussions as well as conclusions. As Brumfiel discussed [ 9 ], open archives or preprint servers such as arXiv [ 11 ] are often misused for plagiarism and authors with poor English knowledge tend to copy phrases from their own earlier work or the work of others. Therefore, the content of previously published paper(s) by the same (group of) author(s) also has to be evaluated to differentiate between self-plagiarism and the correct re-use of previous published works. While similar and matched content detection by software is very quick and useful, this could be coupled to human analysis for better efficiency as evidenced in [ 12 ]. The more sophisticated the results of text analysis software are, the more solid is the basis on which the editor makes his/her decision. As Glänzel et al. [ 13 ] and also the Nature journal [ 8 ] stated, a careful human cannot be replaced. An automatic rejection based on a simple similarity value therefore should not occur.
An academic odyssey: Writing over time
James Hartley · Guillaume Cabanac
Abstract In this paper we present and discuss the results of six enquiries into the first author’s academic writing over the last 50 years. Our aim is to assess whether or not his academic writing style has changed with age, experience, and cognitive decline. The results of these studies suggest that the readability of textbook chapters written by Hartley has remained fairly stable for over 50 years, with the later chapters becoming easier to read. The format of the titles used for chapters and papers has also remained much the same, with an increase in the use of titles written in the form of questions. It also appears that the format of the chosen titles had no effect on citation rates, but that papers that obtained the highest citation rates were written with colleagues rather by Hartley alone. Finally it is observed that Hartley’s publication rate has remained much the same for over fifty years but that this has been achieved at the expense of other academic activities.
Index Terms— Big Data, Text Classification, Joint Com- plexity, Combinatorics, Compressive Sensing, Kalman Filter
Social Networks have undergone a dramatic growth in recent years and have changed the way we communicate with others, entertain and actually live. The communication between users has formed a new era with several research challenges, e.g. (a) real time search has to balance between quality, authority, relevance and timeliness of the content, (b) the relationship analysis between members of a social community can reveal the important teams which can be used for specific plans, (c) spam and advertisement detection to avoid the growth of ir- relevant content. By extracting the relevant information from social networks in real time, we can address these challenges. In this paper we use the theory of Joint Complexity (JC) to perform topic detection. The evaluation of the proposed method is based on the detection of real world topics like the categories of a mainstream news portal. We use large datasets, which are tweets from politics, economics, sport, technology and lifestyle. Then we classify new tweets into these cate-
Abstract - - Introduction: Following any oral surgery procedure, postoperative pain is an inevitable outcome and can be described as moderate to severe. The pain management is essential for the comfort and the well-being of the patients. Topical delivery and more speciﬁcally transmucosal delivery systems seem to be of great value for the development of new pain management strategies. Method: A systematic literature review was performed using PubMedCentral database. Only PubMedCentral indexed publications were selected and included if they described i) a human clinical study with pharmacokinetic and/or pain relief assessment a biomaterial for topic delivery, ii) the delivery of analgesics or NSAIDs for analgesic purpose and iii) a biomaterial for topic delivery. Results: Ten articles were selected among which 4 pharmacokinetic studies and 8 studies describing pain relief. Six of the selected articles were well deﬁned with a good scientiﬁc level of evidence (level 2) and 4 of them with a low level of evidence. Discussion: The clinical investigations demonstrated a good analgesia, a rapid pain relief with a decrease of the administered doses compared to the oral administration. Moreover, these topic analgesics were well tolerated by the patients. Number of devices was developed for the topical delivery after oral surgery procedures. Excepting a gelatin sponge and a hydro alcoholic gel, most of the devices were made of cellulose and its derivatives. Authors reported that the materials showed a good maintenance at the site of application and the release of the analgesic was well controlled over the time. Conclusion: However, well conducted large clinical trials are still missing in order to validate the absence of side effects.
In this paper, we tackle the non-convex problem of topic modelling, where agents have sensitive text data at their disposal that they can not or do not want to share (e.g., text messages, emails, confidential reports). More precisely, we adapt the particular Latent Dirichlet Allocation (LDA)  model to decentralized networks. We combine recent work of  on online inference for latent variable models, which adapts online EM  with local Gibbs sampling in the case of intractable latent variable models (such as LDA) and recent advances on decentralized optimization [3, 4]. The method presented in  is particularly adapted to decentralized framework as it consists in iteratively updating sufficient statistics, which can be done locally. After presenting our D E L E DA (for Decentralized LDA) algorithm, we give a brief sketch of convergence proof. Then, we apply our new method to synthetic datasets and show that our method recovers the same parameters and has similar performance than the online method  after enough iterations.
First of all, we consider that a user expresses the information need by topic only, that is to say that there is no comment in a user’s query. For this reason, any query term is consid- ered as a topic in our approach. On the contrary document sentences contain both topic and comment parts. Since users are supposed to be interested by comments about their topic of interest, we hypothesize that the matching model should consider differently topic/query and comment/query matching. Furthermore, we can assume that matching topics induce that comments are considered relevant information. Thus, the importance of each topic in a document depends not only on its frequency, but also on the number of related comments, i.e. how well the topic is explained in a document. We propose to take the logarithm of this number in order to smooth the influence. On the other hand, some topics may be too specific and thereby linked to few comments. Therefore we introduced the measure of specificity of the topic t Inversed Comment Frequency ICF (t):
Both workflows have in common that dissertation and data are separated, that this separation is operated before or during the deposit of the dissertation, that the deposit is preceded by data curation, and that dissertation and data are not stored and dissemination on the same repository. For universities in Slovenia, especially University of Ljubljana, there was a long way how the legal backgrounds have been prepared or revised to support a mandatory process of ETD (Ojsteršek et al, 2014). The process is still not finished and although the problem of research data was identified, due to the other, more basic, legal and even competence and authority problems it was not really tackled yet. This may change as they will have to adapt and embrace the policy of open access that includes also research data. In the Slovenian National strategy 2015-2020 on Open access, research data has become one of the priorities: “The research data financed by public funds should be as far as possible open, accessible with minimal restrictions. Open information must be given to locate or access them evaluate and understand to be useful for others and, if possible, interoperable, coherent with certain quality standards. Open access to research data is relating to the right to online access and re-use of digital research data under the conditions specified in the grant agreements. Accessing, mining, exploitation, reproduction and dissemination are free of charge. Justified exceptions must be explained, for example, in the interests of national security, protection of personal data and intellectual property rights of private co-financiers. Customer Information Control Systems (CICS) must be compliance with legal and ethical requirements to ensure open access. If the access to research data for justified exceptions is limited, at least a freely accessible metadata must be available, from which it is clear where and under what conditions, research data are available.” 22 A particular challenge the existence of two distinct systems, one maintained by the University with its institutional repository and the other by the National Library. The actual laws on university libraries and the National Library do not mention digital dissertations, just that university libraries must obtain and process the compulsory copies of material that is created and published within the framework of the university, including graduate and masters theses and doctoral dissertations, and two copies of (print) doctoral dissertation are to be sent to the National Library. Electronic versions can be uploaded in Digital library of Slovenia, maintained by National and University Library of Slovenia, only with the written permission by authors.
Lioma et al. use rhetorical relations from SPADE parser to re-rank documents . The authors introduced a query likelihood retrieval model based on the probability of generat- ing the query terms from (1) a mixture of the probabilities of generating a query from a document and its rhetorical relations and (2) the probability of generating rhetorical relations from a document. One of the limitations of this approach is that not all types of texts can be parsed this way (e.g. legal texts or item lists have a few rhetorical relations). In addition, the rule- based parsers even if they take into account some statistics, are not extensible to other languages. An even more problematic drawback is related to the shortcomings of the discourse parser since such parsers are very time consuming and cannot be applied on large volumes of data. Lioma et al. state that topic- comment relations as defined by SPADE are extremely sparse in the benchmark IR collections , while in our approach topic-comment structure is common for all types of texts as well as for all genres.
Our work contributes in a methodology for building topic modeling for legal doc- uments, when the content of cited documents is not available. We propose in the current work to automatically build networks from cases of the Canadian court. The collection presents thousands of cases, and details two major types of citations. The first one refers to prior cases, while the other one to statute laws. We further propose to use these two types of citations to explore the similarity of cases, by constructing homophily relationships between cases. Furthermore, we use case ho- mophily to improve topic modeling for legal cases. In particular, we work on the relational topic model (RTM)  that uses the links between documents during topic modeling. We analyze a publicly available dataset part of the COLIEE chal- lenge. We then construct a homophily network consisting of nodes for legal cases and edges with weights for the references. We compare different strategies of using the edge weights in the homophily network as link information for RTM.
These constraints imply that the strategies that can be used to introduce a topic in a conversation depend on the relation between the current topic of the dialogue and the new topic that is to be in- troduced. The first step in generating transition strategies is thus to define this relation. In the con- text of project Avatar 1:1 (Section 2) we are look- ing at strategies that an agent can employ in inter- action with an unacquainted user to make the tran- sition between two discussion phases about two different artworks. In the current work we will fo- cus on what seems the most extreme case, namely the transition between discussion phases of two very different artworks: Artworks that have noth- ing in common except from the fact that they are both artworks in the same museum. In this way we test if the agent’s topic manager can indeed be al- lowed the flexibility to select any given artwork of the museum as next topic of the discussion. Such flexibility helps finding (initiating) the topic that engages the user most (Glas et al., 2015).
f 1 -score. Meaning that the generalizability of the models is preserved, ergo they did not
overfit on the training domain. So why is it that adversarial training helps in-domain, but does not improve the cross domain performance? At this point, we like to repeat the aforementioned distinction between robustness and generalizability. For us, robustness is more related to the ability to understand language in the sense of linguistic flexibility; be- ing able to understand differently worded phrases about the same thing. Generalizability, on the other hand, is the ability of a model to transfer and apply already learnt patterns to a new domain. In our case, an increase in performance for the models tested on cross topics is related to the generalizability. While depending on the task of the application field, generalizability and robustness have a strong overlap, we think, one has to care- fully distinguish them for argument mining. Usually, cross domain in AM means that the model should be able to detect arguments for a topic unseen during training. Assuming the new topic is not somehow related to the topics seen during training, this means, the model has to infer everything associated with a given input sentence and decide if this can be an argument related to the topic or not. The problem is one can only conditionally infer new arguments from existing arguments in the semantic space. If the two arguments are structurally similar to a certain degree (or use similar key components), it is possible. But finding new arguments for an unseen domain is beyond language modelling. It re- quires also a deep understanding of knowledge and common sense. Especially the latter two cannot be efficiently learnt from word co-occurrences alone [19,10]. As a result, it is not surprising that augmenting training data with alternative wording of the data does not improve generalizability. After all, the examples added for adversarial training are mostly noise with respect to the new unseen test domain; noise, which is not negatively affecting the generalizability of the BERT model.
INSA de Rennes, IRISA, TexMex-Team June, 2012
The growth in the collections of multimedia documents made the devel- opment of new data access and data structuring techniques a necessity. The work presented in this report focuses on structuring TV shows and among the different kinds of structuring we approach the topic segmentation. Moreover we are interested in techniques able to provide hierarchical topic segmentation. The motivation for this research is defined by the potential impact of these techniques, since they fold perfectly on navigation and information retrieval subjects. In order to provide an automatic structuring of TV shows, that is generic, we use the words pronounced in TV shows, made available by their automatic textual transcription provided by an ASR system. The proposed topic segmentation algorithm consists in the recursive application of a modi- fied version of TextTiling. It is based on the exploitation of a technique called vectorization, which was recently introduced for linear segmentation and out- performed the other existing techniques. We decided to study vectorization in more depth since it is a powerful technique and we tested it both for linear and hierarchical segmentation. The results obtained show that using vectorization can improve the segmentation and justify the interest of further applying such a technique.
In this paper, we show the impact of the context in the polarity detection by conducting experiments on data sets of various domains. We use TREC Blog (Text Retrieval Conference) 2006 Data collection with topics of TREC Blog 2006 and 2007 for experimentation purposes . We use a machine learning system and simple features as number of positive words, number of negative words, number of neutral words, and the number of adjectives in a text to the polarity detection. We categorize the topics into six classes (Films, Person, Organization, Event, Product, Issue), and show that this categorization improves the opinion detection. The goal isn’t to use sophisticated level of linguistic analysis but it is to show the impact of topic domain on polarity detection.
An in-depth analysis of the links created by the differ- ent methods was performed. For instance, Tab. 2 reports the proportion of targets that differ between two systems. While all systems exhibit comparable MAP, the pairwise comparison shows that a large proportion of the links pro- posed differs between two systems. This proves that the different strategies proposed here are complementary and hints that all those techniques can be leveraged to propose a wider variety of links than those offered by direct content comparison. We also studied the distribution of the cosine similarity between an anchor and the relevant targets pro- posed by the various methods. As the topic structure gets more complex, from independent topics to tree-structures, the median cosine similarity between anchor and targets gets lower, particularly on the 2013 data. This fact again high- lights the potential interest of topic-based hyperlinking to provide links between segments that share little vocabulary and potentially exhibit serendipity.