Timespans

Background

Archaeological dataset records often give a textual expression of dating rather than absolute numeric years for the dating of artefacts. These textual data values can be in a variety of formats, sometimes expressed in different languages. There can be prefixes present such as 'Circa', 'Early', 'Mid', 'Late' - and suffixes such as 'A.D.', 'B.C.', B.P.' that may influence the dates intended. This can present a data integration issue, as illustrated in the table below:

Type	Language	Input	Min	Max
Ordinal named or numbered century	English	Early 2nd Century	101	140
	English	Circa Second Century BC	-200	-101
	Italian	XV secolo d.C.	1401	1500
	Italian	intorno a VI sec. d.C.	501	600
	Welsh	pymthegfed ganrif	1401	1500
	Welsh	Canol y15fed ganrif	1430	1470
Year span	English	1450-1460	1450	1460
	English	1485-86	1485	1486
Single year (with tolerance)	English	C. 1485	1485	1485
	English	1540±9	1531	1549
	English	AD400+	400	400
	English	400 AD	400	400
Decade	English	Circa 1860s	1860	1869
	Italian	intorno al decennio 1910	1910	1919
	Welsh	1930au	1930	1939
Century span	English	5th – 6th century AD	401	600
	Italian	VIII-VII secolo a.C.	-800	-601
	Welsh	5ed 6ed ganrif	401	600
Month and year	English	July 1855	1855	1855
	Italian	Luglio 1855	1855	1855
	Welsh	Gorffennaf 1855	1855	1855
Season and year	English	Summer 1855	1855	1855
	Italian	Estate 1855	1855	1855
	Welsh	Haf 1855	1855	1855
Named periods (from lookup)	English	Georgian	1714	1837
	English	Victorian	1837	1901

Normalising this data can make later search and comparison of the records easier. We can do this by supplementing the original values with additional attributes defining the start and end dates of the timespan. This application attempts to match a set of textual values representing timespans to a number of known patterns, and from there to derive the intended start/end dates of the timespan. For some cases the start/end dates are present and can be extracted directly from the textual string, however in most cases a degree of additional processing is required after the initial pattern match is made. The output facilitates the fairer comparison of textual date spans as often expressed in datasets. Due to the wide variety of formats possible (including punctuation and spurious extra text), the matching patterns developed cannot comprehensively cater for every possible free-text variation present, so any remaining records not processed by this initial automated method (start/end years are blank) can be manually reviewed and assigned suitable start/end dates.

Issues to note

The output dates produced are relative to Common Era (CE). Centuries are set to start at year 1 and end at year 100. Prefix modifiers for centuries take the following meaning in this application:

Prefix	Start	End
Early	1	40
Mid	30	70
Late	60	100
First Half	1	50
Second Half	51	100
First/1st Quarter	1	25
Second/2nd Quarter	26	50
Third/3rd Quarter	51	75
Fourth/4th Quarter	76	100

In the case of decades, centuries or stated tolerances, an offset is added or subtracted from the initial extracted year in order to interpret the overall extents of the year span being expressed. (e.g. "1540±9" = min year 1531, max year 1549)

For matches on known named periods (e.g. Georgian, Victorian etc.) the start/end years are derived from suitable authority list lookups.

Usage

Command: timespans -i:{inputFileName} [-o:{outputFileName}] [-l:{languageCode}]

Input File Name (required)

The name (including path) of a text file containing a list of the timespan expressions to be matched, one per line. The matching patterns used are case insensitive. If no match is found then a result is still returned, having blank dates - indicating no appropriate match.

Output File Name (optional)

The name (including path) of a text file to write the output to. If this file is not present it will be created, otherwise it will be overwritten. If this parameter is not present then the file name used will be the input file name appended with ".out.txt" The output data format is tab delimited UTF-8 text, this can be easily used within spreadsheet applications for further processing as required.

Language Code (optional)

The ISO639-1:2002 language code corresponding to the language of the input data. This hints to the underlying matching process the most appropriate matching patterns to use. The languages with best support so far are English ('en') and Italian ('it') - some matching patterns are present for Welsh ('cy'), German ('de'), French ('fr'), Spanish ('es') and Swedish ('sv') but these are currently comparatively underdeveloped and untested with representative data. If the language parameter is omitted or is not one of the recognised values then it will default to 'en' (English).

Examples

Command: timespans -i:{inputFileNameWithPath} [-o:{outputFileNameWithPath}] [-l:{languageCode}]

English examples

Command: timespans -i:myinput.txt -o:myoutput.txt -l:en

myinput.txt: The input is a text file containing one timespan value per line e.g.

1839-1895
1839-75
c.1521
Early 2nd Century

myoutput.txt: The output is a tab delimited text file with years (in ISO 8601 format) assigned to the timespan values

Text Value	Min year	Max year
1839-1895	+1839	+1895
1839-75	+1839	+1875
c.1521	+1521	+1521
Early 2nd Century	+0101	+0140

Italian examples:

Command: timespans -i:myinput.txt -o:myoutput.txt -l:it

myinput.txt:

140-144 d.C.
III e lo II secolo a.C.
intorno a VI sec. d.C.
tra IV e III secolo a.C.
575-400 a.C.

myoutput.txt:

Text Value	Min year	Max year
140-144 d.C.	+0140	+0144
III e lo II secolo a.C.	-0300	-0101
intorno a VI sec. d.C.	+0501	+0600
tra IV e III secolo a.C.	-0400	-0201
575-400 a.C.	-0575	-0400

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data/output		data/output
src		src
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/output

data/output

src

src

readme.md

readme.md

Repository files navigation

Timespans

Background

Issues to note

Usage

Input File Name (required)

Output File Name (optional)

Language Code (optional)

Examples

English examples

Italian examples:

About

Releases

Packages

Contributors 2

Languages

cbinding/timespans

Folders and files

Latest commit

History

Repository files navigation

Timespans

Background

Issues to note

Usage

Input File Name (required)

Output File Name (optional)

Language Code (optional)

Examples

English examples

Italian examples:

About

Resources

Stars

Watchers

Forks

Languages