Introduction to ABBYY FineReader 11

  
  
  
  
Layout:

Get the Flash Player to see this player.

Mark Video Segment:
begin
end
play
[Hide] Copy and paste this link to an email or instant message:
[Hide] Right click this link and add to bookmarks:
Dock windowSearch
Terms:
 
Loading ...

Metadata
Title:Introduction to ABBYY FineReader 11
Publisher:School of Information
Creator:Quinn Stewart


Note: Please click "Run Program" to open Abbyy, we use a leased software license, and you will also need to go to Tools> Options, select the Scan/Open tab .... then under General, select "Do not read and analyze acquired page images automatically", then click "OK".

Introduction to Optical Character Recognition (OCR) with ABBYY FineReader 11.

It seems simple really, scan an image of a page of text, and let the computer turn the image of each letter into the correct letter. Hence the name "optical" for the image of the page, and "character recognition" for recognizing the characters. If it were that easy, we wouldn't need this tutorial. "Recognition" to me implies a certain intelligence, an intelligence that computers don't have. OCM would be a better term to use, what the computer is doing is character "matching", not character recognition.

The process covered here involves the scanning of archival typewritten documents, created with manual typewriters between 1952 and 1958. First, a set of image files are created, at 300 dpi, grayscale, with LZW lossless compression. Archival master TIF (Tagged Image Format) files are saved, as separate images, and the ABBYY project file is saved as well. Once the ABBYY project file is saved, it automatically saves from that point forward.

The files are then backed up to a Network Attached Storage (NAS) device, which students access via a symbolic link in their home directories on the iSchool server. The TIF image files can be transferred using Secure Shell File Transfer client, but the ABBYY project file MUST be compressed as a .zip file before transferring. Saving the files to the NAS will allow students to continue working on these files from computers outside the 1.210A Computer Classroom.

ABBYY FineReader 11 is fully capable of scanning, analyzing, and reading a document in a single step, but for this project, we needed to scan and return a large quantity of archival documents, so the setting in ABBYY to scan and read was turned off.

After the project is backed up to the NAS, it is then returned to a computer running ABBYY, and the pages are "read". Here is where the fun begins! Computers can be really dumb, they are incredible at quickly matching patterns, but cannot really think for themselves. Here is one example. Manual typewriters are precision mechanical instruments, that transmit the mechanical force applied by a human finger to a series of levers and hinges that results in a metal arm with a letter stamped on the end of it striking a ink ribbon, then a piece of paper, and then a rubber roller. With prolonged use, the mechanical components can wear, changing the spacing between letters, changing alignment, etc. In fact, each typewriter nearly has a unique "fingerprint" of sorts, it is often possible to trace typewritten documents back to the typewriter that created them due to these unique differences.

Here, these differences give ABBYY fits. ABBYY hates the number 4 on these pages, and wear on the typewriters creates several letter combinations that are simply too close together for ABBYY to read correctly. These result in "uncertain characters", where ABBYY is just not sure whether it got things correct or not. So you have to decide during the correction process.

Then, there are style conventions used in the transcripts. Two are particularly troubling to ABBYY, the O.- sequence to indicate which person is speaking, and the use of -- to indicate a pause. By altering the settings in ABBYY, you can have it either stop for you to check each uncertain character, or just pass them by. Our goal here is to harvest the PAGE NUMBERS and text from these documents, the PAGE NUMBERS are critical, because they tie back to the full index of terms. The two .- and -- combinations can mostly be ignored.

The OCR process then involves letting ABBYY read the document, then adjusting the options in both ABBYY and the spellchecker to correct each page, then double-check the page numbers. Once the OCR and spell-check process ( and please do correct any mistakes in the transcript) are complete, you will then save a text-only exact copy of the document, with correct PAGE NUMBERS!!! This text will then be used inside of Glifos Social-Media to synchronize with the audio file, and form the basis of what GSM uses to search the transcripts.

Dock windowTranscript
This tutorial is going to guide you through using Abbyy FineReader 11 to scan and OCR some text documents.
The 1st thing were going to do is go to Start >All programs > Abbyy FineReader 11 and then select Abbyy FineReader 11 here.
 There are a number of procedures you can use  to scan and OCR using Abbyy. What we're going to do is actually scan all of our pages to image files 1st, before we do the OCR.
So I'm going to click scan and save image here, and that's going to bring up the Abbyy FineReader dialog box.
Were going to check and make sure Abbyy is set for 300 dpi using grayscale. Our pages are 11 x 8.5 here, and we're going to leave the image preprocessing set at the defaults.
 I'm going to select this multipage scanning setting here this is going to help speed things up tremendously.  I'm going to have it pause for 10 seconds after each page.
That's going to allow us to use both hands to feed the scanner, and try to keep up with Abbyy, and get a lot of these pages taking care of.
Now I'm going to insert my 1st page into the scanner, and I'm going to click Preview here. You'll see Abby create a preview scan, and you can see there is scanner area down at the bottom of the scanner,
We need to make sure this is pulled up to the bottom of the page, but to make sure that we don't cut off this bottom line and that's one of the reasons that we'll always put our pages to the very top of the scanner as we scan them.
Now that our preview scan is done, our 1st page is in all the way at the top, we can go-ahead and click scan here to start the scanning process.
Abbyy will go through and scan the 1st page, and when it gets to the bottom and returns, you can pull the 1st page out, put the 2nd page gently on the scanner bed, push it to the top, and Abbyy will begin to scan the 2nd page.
And you can continue this process until you're done with all the pages.
Once you get to the last page, you can wait until Abbyy is done scanning the page, and then you can click Stop to stop the scanning process.
And now it's a good time to take a look and step back through each of the pages, and make sure you scanned all the lines, in all of the text, on each of the pages.
Now that we finished our scans, we are definitely going to want to save our pages.
 I'm going to go here to File, Save FineReader document, and here I'm going to go create a folder on the desktop, and here's where naming conventions become real important.
I'm going to create a new folder here, each of these tapes has a 3 digit number associated with it, were going to use the naming convention e_toi_ and then a 4 digit number here usually starting with 0.
So this one is a little bit different, because it's tape 205a, so I'm going to have 0205a here.
Yours probably will not have that a. So I'm going to click okay here, and that's going to make my folder, and then I'm going to open up the folder, and name this document e_toi_0205a.
And I'm going to click save, and this is going to save my Abbyy FineReader 11 project file here.
 Now because were going to need the individuals page scans for another project, I'm actually going to go back through here and save another set.  This time I'm going to save images.
And I'm also going to save these on the desktop, and I'm also going to create a folder to put  them in. And so I'm going to go up here to new folder, and this one is going to be e_toi_0205a_ and I'm going to call these archival_masters.
These are going to be the archival master images of these page scans.
So I'm going to double-click this, so I'm saving inside this folder, and I'm going to use the same file name e_toi_0205a.
And here we need to select our file type. If you'll remember, we actually scanned these in grayscale, yet Abbyy is trying to save these images as color, and that's going to take up some excess disk space.
If we look here, there's lots and lots and lots of choices, but we're just going to want to move up a few slots here and save to tiff, gray, LZW compression.
And that's going to save our images pretty much exactly the way they came off the scanner.
So now I'm going to click save here, to save these page images.
So now if will look on our desktop, I should have 2 folders here. Here's my project folder, excuse me, this is my archival master folder, that has all the documents in here.
And you'll see, Abbyy has gone through and added a page number to each of these. So this is this tape name, and this is the 1st page 2nd page 3rd page 4th page etc.
And now if I look in the other folder here, what I'll see is my Abbyy project  folder. This is an ABBYY FineReader icon here, associated with this project folder.
So now I essentially have this project saved in 2 places, and will be using one for one purpose, and one for the other.
For the time being, we can leave these 2 folders on the desktop of the computer that you're logged into in the computer classroom, but will soon be moving these to a network attached storage space.
And for that, you will need an iSchool username and password.
Now that we've created our archival master scans and our Abbyy project file, lets look at what we need to do to start saving them, forever.
The 1st thing is, here is our archival Masters here, and if you look closely these are TIF files, tagged image format files.
These files should file transfer without a lot of problems.
Now on the other hand, the Abbyy project file itself, this particular ABBYY file icon  and the file associated with it, if you will look here, I'm going to right-click on it and look at properties.
You'll see that this is a pretty big file already, and inside of this is pretty much all of the materials and the scans that we have.
As I've stated before, we are saving this in 2 places for safety.
The trick becomes, this is a proprietary file format, that does not survive file transfer very nicely at all.
In order to transfer this file to our network attached storage device and move it to other computers, we need to do something special with it.
We actually need to compress it. What we're going to do here is right-click, and send to a compressed or zipped folder.
And what that's going to do is go through and actually compress this file and all of the custom pieces and parts of it into a format that we can move around without a lot of problems.
So what you should end up with is another folder right here that looks like this. And if you mouse over it, you should see this is a compressed zip folder.
Notice it's a little bit smaller, sometimes considerably smaller, but in this case, well, oops wrong file here, this is 42 megs, and this compressed it down to 38 megs.
Not all that much, but the most important thing is this makes the Abbey project folder movable across a UNIX network attached storage system.
And that's a real important thing for us to do here, and will also allow you to move your files outside of the computer classroom and into the main lab or other places to work on them successfully.
Now we have some files and folders that we need to save from our computer in the computer classroom to the network attached storage space that's attached to our iSchool account.
So, if you've got an iSchool username and password, what will do is go to Start, All programs, SSH Secure Shell, and the Secure Shell file transfer client here.
Once the program is open, will need to hit Enter or the space bar in order to connect.
Our hostname here is ftp.ischool.utexas.edu, and now are going to enter our username here. And we are going to click Connect.
And if this is the 1st time we connected, we'll select Yes here. And we'll put in our password, the password to your iSchool account, and then we're going to click Okay.
Now my account here is a big mess, but what I'm looking for, and what you should see, is this link right here to sod_fall_2012.
I'm going to double-click on this, and inside this folder, what you have just traversed is a symbolic link into 10 GB of network attached storage space to support this class.
Please just use this space for the SOD class, and not for your online movie collection.
The 1st thing we will need to do is set up a directory here. I'm going to click New Folder, I'm going to call this folder text, because that's what were working with here.
Once I create that folder I'm going to double-click on it, open it up and then I'm going to go over and dragged his zipped ABBYY folder over into this space.
And it will be moved up, and then I'm going to grab this archival_masters folder here, and pull it over here. I can also do the same thing right here as well.
And what that's doing is backing up all the files and folders that we created with our scanning to the network attached storage space.
You'll repeat this process throughout the course to back up your files, just keep in mind that if you pull this zip file down to another computer and work on it,
It's going to be this zip file that you need to keep up with and move back to your computer in the computer classroom to continue working on it.
Because the version in the classroom will be an older version.
The network attached storage space is also where we'll be collecting your files for grading.
Now let's scan some documents that are going to be a little more difficult to OCR.
These are the documents from the Hoccleyve archive, and there are 2 types of these.
We'll go ahead and click scan here, the 1st document is going to be the collation tables that were used in the transcription of this piece.
 and for these, were going to bump this up to 600 dpi, for smaller text hopefully to make this a little easier for this to work.
and, we're going to leave this on grayscale here, because these are black, white, and pencil markings that are gray,
and make sure that we have all of these other elements, we'll let it enhance the images for OCR, and see how that works out as well.
And for these, I would really encourage you to maybe set your time a bit longer than this, I"m only going to do a single document here, and see how it works.
I"m going to click Preview here, to preview it. And as you can see, these documents are edge to edge, and margin to margin.
What we're going to be concerned with here is making sure we get all the information.
 we don't know how well Abbyy is going to be able to do with this, but right now we are just scanning for the page images and trying to get a good scan of each of these.
Will also be paying close attention to this number here which is a unique identifier, will be assigning you a naming convention in class, but for now I'm going to just use this page number when we get to that point.
So once I got my margin sets, we can go and click scan here.
Once it's done, I'm going to go ahead and close this window.
If you end up with something that looks like this, Abby has gone ahead and tried to OCR this document.
Wasn't what  I asked Abbyy to do.
I can go up here to Tools>Options Scan and Open, I'm going to have it do not read and analyze acquired pages automatically.
And then I'm going to click okay. And I'm going to go back appear and click scan again.
And I'm actually going to go right click over here, and have it delete that page.
So now It shouldn't try to do that for us, and I'm going to go ahead and scan this again.
You'll notice that scanning at 600 dpi takes a little bit longer.
Once Abbyy is completed right here, we can close this, and this is our resulting scanned image.
I'm going to click down here to actually view this zoomed image, because what I'm going to be concerned about as part of the filename is going to be this page number right here, 3627.
It may not actually be a page number, but it's one of the unique identifiers that will be used in the naming convention you will be given in class.
So the next thing I want to do here, is save this. So the 1st thing I'm going to do, is I'm going to go up here and save this fine reader document. Again, you will be given this naming convention in class, but this document is going to contain all the pages.
So I'm going to name this collation_pages, and then click okay here. And then inside this folder, I'm going to put this 3627, and save that as well.
So that saved my Abby fine reader document, the next thing I'm going to do is actually save the page scans.
So I'm going to select save images here, and this is going to put it on the desktop, but I would like to have a folder to put it in, so I'm going to create another folder here
And I'm just going to call this collation_pages_archival_masters.
 and I"m going to double-click  and open that up, and then put in this 3627 here to indicate this particular page I'm working with, and our compression on this is going to be tiff gray LZW compression right here, and I"m going to click save, to save this page.
There's another type of document were going to be working with as well.  so I'm going to go up here and start a new task here, it's also going to be scan and save image.
That's going to be bringing up my scanning window here again, but this time were going to be scanning the actual manuscripts themselves,  that were in a dot matrix format line by line.
So this, because it's in pretty  good shape, were going to go back down to 300 dpi grayscale on these particular pages.
And here again, you might increase your pause for each second,  so you can handle these a little more carefully, but they're actually in pretty good shape.
Then I'm going to click scan here, and since I'm only scanning this one pages as an example,  I'm just going to go ahead and close this, and it's going to asked me if I want to save this file.
And this is a little bit different, actually it's not, it's trying to save this image that I just scanned, but I don't want to save it in the same folder because it's a different thing.
So I'm going to go appear, create a new folder called manuscript, going to double-click on it, again will be giving you the file naming convention for this, I'm going to call this temporarily just GreetX2,
And this were going to be saving as tif, gray, LZW compression. I'm going to click save here.
And you can see the page image were going to be working with here as well. Now in addition, I also need to save my fine reader document.
So I'm going to go file, wait a minute cancel, file, save fine reader document, I'm going to give this document a name, call it greetx2, and I'm going to click save, and that's going to save my Abby fine reader document as well.
And will return to this file when we go to do OCR on these documents.
Now let's go ahead and make our image scans for our last task this semester.  I'm going to go ahead and select task here.
This time were going to be scanning and saving an image again.
This time were going to be scanning newspaper clippings from the New York Journal American, and these are really small type, they have different colors in them.
So were going to do some changes here, were going to go with 600 dpi, and  we're going to do these in color, and were going to leave the rest of these the way they are,
And I'm really going to recommend that you scan these very slowly, if not one by one,
you're going to have to be going through the document folders to see what will even fit on the scanner anyway,  so it's doubtful were going to be able to use this feature here.
I'm going to go and click preview, once our preview scan is done, we can go ahead and pull in our margins, and make sure that we get the whole piece here.
As you can tell this is some old newspaper.  Interestingly enough, this is from September 11, 1941.
Once we've got our borders pulled in here, then we can go ahead into our full scan.
And again, this scan will probably take a little longer. Once were done, we can click close here.
We'll be supplying you with the naming convention in class, for now, I'm going to go up one level here, and see this is trying to save my image.
What I need to do here is create a new folder, and what I know here is this is the New York Journal American, what I'm going to put here is archival_masters.
I'm going to double-click here, and my actual filename is going to be a little  more difficult, for right now I'm just going to call this New York Journal American_9_11_1941.
And this one because we scanned it is color  we're definitely going to leave this as color tif LZW compression. I"m going to click save, to save the scan newspaper clipping.
Once my clipping is saved, I'm going to want to go up and save my fine reader document.  Here I'm going to create another folder, and call this the New York Journal American_Abbyy_scans.
That way I'll know what they are. And then I'll go inside that folder, and I'll do my nyja_9_11_1941.
This won't be our final naming convention, but we will give you that in class.
I'm going to click save here, and now I've got my New York Journal American scanned newspaper article that's ready to be OCRed.
Now let's begin the OCR process with Abby fine reader. The 1st thing I wanted to do is pull down by files off of network attached storage.
So I'm going to go to secure shell file transfer client.
I'm going to press the or the enter key.
Enter in FTP.high school.U Texas.edu and my username.
Now my password
And now I'm going to look for the SOP_2012 link, minds kind of messy, your should be easier.
So I'm going to go into my text folder here, and I'm going to look for my zip file,
I'm not worried about the archival Masters here, just this civil file.
I'm going to right-click here, and go to download, to download this file.
And secure shell should download the file.
So here's my downloaded file, and I can go ahead and close secure Shell.
And now what I'm going to do is right-click, and decompress this file, by going to extract all.
As you can see, this is going to put it on the desktop here,  So I  am going to extract, and here's my Abbey project file that was extracted from this compressed zip folder on the NAS. 
The next thing we want to do now is open up Abby fine reader.
I'm going to navigate to the desktop, and open up, not the archival Masters, but the folder.
Find this icon right here, click, and open it.
And now we should have our original scanned Abbyy project folder returned to Abby fine reader from the NAS.
Now let's begin the OCR process.
I'm going to pull this down a little bit here, so this is the actual file that we scanned earlier.
And now what were going to do is perform optical character recognition on it.
And in order to do that, what I'm going to do is go over to the document menu here and select read.
And Abby is going to go through and perform optical character recognition on the 7 scanned pages.
In one of the 1st things you're going to notice if you don't pay attention to this little trick is Abby has kind of thrown things where it wanted to.
We need to actually change this to exact copy.
And once I do that, I can go back up and make ABBYY read the document again.
It's going to say I  have already done that, click okay to re-recognize the pages.
I'm going to click okay here.
And Abby will reread the pages with exact copy selected.
And you'll see the document looks much closer now
You will also see our pages have acquired an icon over here, that means that they have been recognized using an OCR process.
Based on our experience with these transcripts, we're going to make one other important change to this process.
Were going to go up to tools options and here scan and open, this is where we have it do not read and analyze the pages when were actually scanning them,
This time what we wanted to  do is go to view here.
And in this text window, you can see that we have it highlighting uncertain characters and non-dictionary words.
You better believe there are a lot of uncertain characters in these stories.
So we're going to deselect this, and click okay.
And once again, we're going to have Abby read it again. It's going to complain, where going to say okay.
We're going to have it go through and reread the pages, this time ignoring uncertain characters.
And you should see a little less blue over here.
Were going to make sure were an exact copy, so now were ready to start looking at what Abbey did.
Abby uses different regions to do the OCR, so here it's drawn two text regions to do the OCR work on this page.
One of the things we can do is start using Abby's interface to help us a little bit.
If I move this up, will get a closer view of what Abbey has been working on right here.
So you  can see this region is for Pioneers in Texas all, and the lower region is what  Abby has actually OCRed.
The other thing I'm going to do in the Abbey interface here is I'm going to select fit to width, right here or actually  best fit.
Well, best fit, I don't think so.
Let's try fit to width.
And that gives us a little better overview of what Abbey is doing as we scroll down through here.
The next thing were going to do is go through the verification process in the OCR.
So I'm going to click verification here, and Abby is going to begin the verification process.
So right here, it's a little uncertain about this C:, but it seems to gotten it right here, so all were going to do is click ignore.
And it's going to go to fits town right here, it's going to suggest Fitts town,  2 words, but as we can see it's just one word, and spelled correctly.
 so I'm going to do ignore right here
Here's our E:, it's a little uncertain about the character  but got it right, were going to  ignore.
It's wanting to know if that's a small a and it is were going to ignore
The IT is throwing it hear the TV is coming close to the wide were going to ignore that
The 205–is throwing it that looks okay will ignore that
E: is there, ignore, it got the 5 rights going to ignore the:
I'm not quite sure you got the none at a place here, were going to go ahead and ignore this
And the ()?
The S is okay here will ignore, the I is fine,
Here is a very high. That is trying to replace with an asterisk
Were going to go ahead and delete this, and put in the. And go confirm
And here's our ER, it looks fine, and the I there is okay
We've got a high. Here again, so I'm going to backspace input the. In, and confirm.
1927 is fine will ignore that
It's definitely a G, will ignore that, that's a we, it got that
Rhodes S. Is fine, eat, is fine, eat, it's fine, there's no apostrophe T ignore
's, ignore, that's an eye, we can ignore that, 1937 is another one of those high periods, So if will go here and delete the .
Confirm that, the tea is fine, 1934 here, we'll just cleaned that up right quick, confirmed.
The it is fine, dishes it–here, remember this was typewriter days, so will confirm that.
Here's our fits town here, will ignore that and leave it as one word.
Notice we've change from unrecognized words to not in dictionary, that's a W, we'll ignore that, that's fine
That's it., Will ignore, Delaney we can ignore that, it's fine.
ECO will, it's not recognizing because they're so close together there, but that's fine.
ATI ON, ATI all is fine, there's the eye, and that's fine too,
Fits town again, now that we've seen this a time or 2, we can say ignore all on fits town.
Here we have a compound word that it doesn't understand here, because I know were going to go into a place where we're not going to need hyphenating compound words I'm going to go ahead and replace this with one word right there.
Here's our S., We can ignore that
, We can ignore that
And there's the I, we can ignore it, and there's an', we can ignore that
The W looks fine, the EE–looks fine, and I really can't tell where this is located, so I'm going to ignore it.
Now you can see were starting on page 2, this I KE is really an eye capital in were going to confirm that
It's a little uncertain about the eye, will ignore that. DS was recognized correctly, so was the you, so was the T–, so was the S.
The we is correct, if you'll notice here and were watching closely, this didn't say hove, this said have, so that's where the OCR just plain failed.
Here we have an E, were going to confirm, we have an EL, were going to ignore that we have a.–will ignore that,.–again, will ignore that
And will ignore those 2,
Abby seems to be doing a pretty good job with these uncertain characters. Let's take a look at these options right here.
What were going to do, is where going to have Abby quit stopping at words with uncertain characters, and see how that works for a little while.
I'm going to click okay here, now we should see more red than blue, so here's an EE, and H, that looks good
Here's our capital E  again, H, W, Delaney, we know Delaney's okay we've seen it before let's do and ignore all here,
Lots of man hours to do the work thought, I think there's a good chance that it's that, so will replace
Attention one word hyphenated, but go ahead and replace that.
Ahh here's a good one, this happens a lot, this 4 here is tough.
 Let's see if we can ask the add this 4 here into the custom dictionary.
Okay, men, that got men again, ignore that.
And, let's see here, it's got 2 dashes, I think were going to ignore that one as well.
Bessemers, Bessemer is correct, we will ignore all those,
We have an EH, were going to ignore
 here's the  H, the E, H, were going to start ignoring all on that one, and were going to start ignoring all on that one.
TH a N, look down here, were going to replace, and the spell check is complete,
Notice we didn't verification, but in our own spell check, that's just a change between Abby 10, and Abbey 11.
Either way, we've verified in check spelling on Abby's OCR work.
So I'm going to click okay here, and now if you'll notice each of these pages now has an icon with a check on it, that means it's been recognized and verified, or spell checked.
Now let's  double–check Abby's OCR for something that's really important to these pages, which is the page number.
The page number of these is tied to the cumulative index, so we need to make sure that this page numbers stays with the  text that was recognized.
So I'm going to look back through here, and make sure that each of these pages has the page number.
But this looks a little strange right there, I don't know what Abbey is recognizing that as, but if will go head increase the size of this right here so I can actually see what's going on
I should be able to see what Abbey put there, and I don't think it's page K
What I'm going to do now is see if Abby will reread it properly, I'm going to have it read that area, and it still got it wrong.
So I'm going to go back up here, see if I can correct this, to page 4.
In then I'm going to step back to the other pages like page 3, and see if I can zoom in a little bit here, move over, and let's just check our page numbers.
That's 3, and less zooming here again, that's page 2, it's looking okay, there won't be one on page 1,
Let's check 5, page 5, but again we need to see with the OCR version looks like, it got 5 right.
6, let's zoom in, 6, and then we'll check 7, and it looks like it got page 7 okay.
So again, please pay special attention to the page numbers of these, and make sure Abby gets the page numbers correct
Now that we've checked all of our pages for page numbers, the next thing we need to do is ask the harvest this text, so we can match it up with the audio files.
In the way were going to do that, is where going to select one of these pages over here,
And were going to go to edit, select all, and then were going to go here to file, and were going to save this document as a text document.
And what were going to want to do is make this the simplest document we possibly can.
So I'm going to put this on the desktop, this is what the filename is going to be called, and were going to create a single file for all of the pages, excuse me.
And then were going to go under options right here, and were going to maintain our line breaks here.
And then once we do that, were going to click okay. And now are going to click save, to save the OCR text is a text document.
You should end up with a document that looks like this.
The main thing were concerned about here is these headings,     Pioneers in Texas oil, and P2 or page 2.
We need to make sure that we can tell the difference between the pages of this, and that it's in plain text.
And so this document looks pretty good, and we should be able to use it to paste in the GLIFOS, for the next part of this project.
The last thing you need to do here, after you close up this document, and go ahead and close ABBYY Fine reader.
Is to make sure that you take this text document that you just created and put it on your server space on the NAS so we can access it outside of the classroom for grading.
Dock windowTable of contents
Scan and Save Image settings
300 dpi grayscale
multi-page scanning, pause for 10 seconds
Scan images
Preview scan
Place document at top of scanner
Scanning workflow
Review scanned pages
Saving ABBYY project folder
Naming convention- e_toi_0tape_number
Save ABBYY project folder
Save page images
image folder naming convention  archival_masters
image file type selection- tiff, gray, LZW compression
Review saved folders and files
Saving project files and folders
Compressing (zipping) ABBYY project file
Right-click, send to compressed or zipped folder
File size comparison
Using SSH Secure Shell to move files to your NAS space.
Open SSH Secure Shell
Hostname is ftp.ischool.utexas.edu
Locating sod_fall_2012 directory
Setup "text" directory
Move project folders
Warning about folder versioning and grading
Scan collation tables
600 dpi, grayscale
Scan manuscript pages
300 dpi, grayscale
Scan NYJA newspaper clippings
600 dpi, color
Set scanner area
Optical Character Recognition (OCR)
Retrieve files from NAS
Download zip file
Decompress, unzip, extract all for zipped file
Open file using ABBYY FineReader 11
Begin OCR
Read pages
Use "Exact Copy" and re-read
Tools>Options>View  de-select "highlight uncertain characters and non-dictionary words", and re-read
Review OCR regions
Adjust ABBYY interface
Fit to width
Verification of OCRed text
Verify page numbers
Export OCRed text as a text file
OCR of Hoccleve Manuscript page
OCR of Hoccleve Collation page
OCR of NYJA Newspaper clipping