Hello from the Lead Programmer

Hi everyone and thanks for your interest in Scripto! We’re excited to bring you this tool, so until it’s released we’ll be posting periodic updates here on this blog. My updates will be technical in nature, but I think it’s important to keep you, the user, informed about our technical decisions. So excuse my jargon and feel free to ask questions in the comments below.

Early development of the Scripto code is in full swing, and, as expected in such a project, we’re facing some interesting questions and complications. Imagine the uncertainty involved in building a bridge between an anonymous content management system and a wiki with peculiar conventions. Even so, we’re committed to MediaWiki to work as Scripto’s database and administration tool. In fact, it was a no-brainer:

  • It is the most popular wiki application and had a sizable and active developer community;
  • Wiki markup is relatively easy to learn and there are useful editors available;
  • It offers helpful features, such as discussion pages and user administration;
  • It comes with a powerful, fully-featured API.

One of our first questions was deciding whether to offer document-based or page-based transcriptions. In the former the user transcribes the entire document on one page, whereas in the latter the user transcribes one page at a time. Thankfully, Scripto’s web designer and usability expert, Ken Albers, convinced me that page-based transcription is much more user friendly. Just think how unwieldy a one page transcription of a 400-page document would quickly become! So expect to transcribe only one page at a time.

Another question concerned how we would reconcile document and page naming conventions between an anonymous CMS and MediaWiki. You’ll be happy to know that compatibility with virtually any system is our primary goal, but this means that we can make no assumptions about a “correct” naming convention. Most systems will associate their documents and pages with unique keys, which meshes well with MediaWiki’s own naming scheme. But the edge cases will not have, or will choose not to use, unique keys, and instead use the document and page titles. Unfortunately, MediaWiki has strict naming requirements for its pages, so we had to devise some way to mint potentially incompatible names into an allowed format. After testing several options, I think we have a workable solution.

Another interesting issue we’re facing is how to anticipate future Scripto development models. At Scripto’s core is a software library that interfaces a CMS and MediaWiki. We can’t assume that developers will want to implement this library as a separate application; some may want to integrate it into their own CMS, utilizing a plugin or module interface familiar to them. Accommodating these two development models will require the Scripto library to work within larger application frameworks, so we must consider things like namespaces and sharing HTTP sessions carefully.

Well, if I lost you, I hope this still serves as an illustration of the questions we’re asking and complications we’re working through during early development, and proof that we’re trying hard to create a transcription tool that’ll be useful for users and developers alike.

Jim

Posted in Code, News
11 comments on “Hello from the Lead Programmer
  1. Hi Jim

    You might be aware of University College London’s Transcribe Bentham project which is using MediaWiki to capture transcriptions in a crowdsourced way. It looks like Scripto is going to have a much more robust and reusable code and data model, whereas we have chosen to work entirely within MediaWiki, but we will watch your developments with interest, and I hope we can share some ideas and experiences. Features that we are trying out with Transcribe Bentham that may be of interest include support for TEI tags within the user-submitted content, and an embedded tile-based zoom viewer. Drop me a line if you want to get in touch with the TB Team!

    • Jim Safley says:

      We are indeed aware of Transcribe Bentham and are fans of your project. I feel our two approaches to collaborative transcription are complimentary. By working within the software, Transcribe Bentham has utilized some advanced MediaWiki features not easily accessed from without, such as templates, categories, and extensions. Conversely, Scripto’s primary focus is speedy implementation, where no modifications to MediaWiki or the existing CMS are necessary. We’ve been following your progress and are very interested in knowledge and technology sharing. Thank you for starting the conversation!

  2. Rintze Zelle says:

    This tool reminds me of the Distributed Proofreaders project (http://www.pgdp.net/).

  3. Hi Jim,

    Thanks for going public with the product development and the source code repository. I think that a lot of us will be very interested to see what you’re doing and the decisions that you’ve made.

    One of the things I like about your page-centric approach is that it’s far easier to aggregate individual page transcriptions into a larger document than it is to break up a large transcription into pages. The downside is that it becomes very difficult to representing groupings like sections or paragraphs that span more than one page. The ProofreadPage plugin for MediaWiki does an excellent job with this, seamlessly including pages into an overall document description, while still providing page-by-page transcription editing and quality control. They address the overlapping markup problem by using markup for section headings rather than sections, and suppressing wrapping tags when pages are aggregated into a document display.

    My own approach with FromThePage emerges from the one-entry-per-page format of the 20th-century diaries I’m transcribing. That conflates entries with pages, but allows me to use my metadata a bit more effectively, e.g. displaying page titles as entry headers when I aggregate page transcriptions into a document.

    I look forward to seeing what you come up with!.

    • Jim Safley says:

      Great hearing from you! We’ve been following FromThePage with interest and are impressed with your work. We currently have no plans to reconcile the page spanning problem, beyond stitching the pages together as-is. At this stage in development, our priority is to facilitate transcription of manuscripts for eventual import into a CMS for full text searching. But as we progress, I’m sure we’ll confront these issues more directly.

  4. wragge says:

    Jim & team,

    Great to see things happening — I’ll be following with interest. I was wondering whether you had a set of document-types that you’re using to guide development. I’m thinking particularly of the continuum from structured (eg forms), through semi-structured (eg memos, reports) to unstructured (eg diaries). Will it be possible to set up Scripto to capture structured information in a reusable way? Could Semantic Mediawiki be used rather than the vanilla version to make it easier to expose and manage structured information?

    I’m working on a project (http://invisibleaustralians.org) to crowdsource the extraction of data from forms like this – http://www.zotero.org/groups/invisible_australians/items/collection/2810029 . So I’m wondering whether Scripto will be able to meet our needs.

    • Tim, have you looked at FamilySearch Indexing? They have an amazing tool for entering form data, with a UI that highlights locations on the scanned image that correspond to the field and record the volunteer is transcribing. Unfortunately, their software seems to be entirely closed, so that it’s hard to even find out who wrote it. Perhaps you can do better than I could when I reviewed it three years ago.

      Supporting both structured and free-form manuscripts is very hard to do with the same software. Maybe Scripto will be more successful at it than others–including myself–have been.

    • Jim Safley says:

      Currently we make no assumptions about what document types our users are transcribing. Scripto is simply a bridge between MediaWiki and an anonymous CMS containing document page images. It exists outside the MediaWiki UI, therefore there are certain features and extensions that Scripto users may have a hard time using. Even so, Scripto can take full advantage of extensions that extend the markup, like Semantic MediaWiki.

  5. Peter Bajcsy says:

    There are many existing solutions to the Scripto project problem. Check out the transcription services provided for the Lincoln Papers project (up to 300,000 pages ~ 34TB of image scans, handwritten documents) at http://isda.ncsa.uiuc.edu/lpapers/index.html.

    The transcription is per page and requires a login – see http://isda.ncsa.uiuc.edu/lpapers/index.php?page_ID=236867&num=01-02-03

    • sleon says:

      Thanks for reminding us about the Lincoln papers work. It’s a wonderful and ambitious project! Do you have plans to generalize and release the code so that others can follow your lead?

  6. Matt Phillips says:

    Hi Jim,

    this is a very interesting project. I particularly like the plugin/module interface idea, which will definitely save time and headaches for admins! I know that Scripto is a work in progress but do you have a target release date in mind?

Download

Scripto is a free, open source tool for enabling community transcriptions of document and multimedia files.