Dataconversion validating xml
Only you, after looking at your texts and your encoding practices, can do the intellectual work required to convert the texts to support the necessary Text Class data structure (link?
) You should do this with the tools you are most comfortable using, whether they are macros in your favorite editor, perl scripts in you have strong programming skills, Omni Mark, or XSLT if your source files are currently or can be converted to XML.
Runs on one file at a time and prints to standard out, but can be invoked through a foreach to check many files in one command.
If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.
Now that we know which items need what character treatments, we'll convert them. So, we'll use ncr2utf to convert the entities into the characters. & is the ampersand (as is &) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Many of you may be in a position where you'll want to be converting your SGML files to XML.
Many of you will be fortunate enough to have files already in XML -- say, finding aids in EAD 2002.
We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports.
-- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( é for é, for example).
jhove The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document.We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. You'll note we don't need to echo the filename as that's part of the jhove report. So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is.Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all.Nevertheless, we do provide some documentation on strategies, tools, and methods that we have found helpful for data conversion.Some of this documentation is class-specific, and some deals with more general Unicode and XML issues.
, you will find four sample files that we'll examine for character encoding and then convert to UTF-8.