Sinclair Heaven and issue hosting and OCRing
-
- Techno Teaboy
- Posts: 16
- Joined: Sat Mar 05, 2005 12:40 pm
- Location: Nottingham
- Contact:
*Gets over the shock of joining a C64 forum*
Hi folks,
I will be loading the first few issues of c&vg onto the server this afternoon. There is a page viewer which errmm views the pages.
If you become a member, there is also a pdf download which is currently just the images in one file. I have a project running to do proper pdf's of magazines so they are searchable, but this is a huge task, so if anyone wants to help out, you are more than welcome. Just check the WoS forum under Preservation.
If anyone would like other magazines hosting to preserve bandwidth, drop me a mail and I'll see what I can do. My site is for Sinclair stuff, but I see no reason not to put others on there.
Please be gentle with the downloads (ie don't do them all at once!). Bandwidth is not an issue, but it slows down other peoples downloads.
You can see the site at http://www.sinclair-heaven.net
Its still being developed, so be patient!
Hi folks,
I will be loading the first few issues of c&vg onto the server this afternoon. There is a page viewer which errmm views the pages.
If you become a member, there is also a pdf download which is currently just the images in one file. I have a project running to do proper pdf's of magazines so they are searchable, but this is a huge task, so if anyone wants to help out, you are more than welcome. Just check the WoS forum under Preservation.
If anyone would like other magazines hosting to preserve bandwidth, drop me a mail and I'll see what I can do. My site is for Sinclair stuff, but I see no reason not to put others on there.
Please be gentle with the downloads (ie don't do them all at once!). Bandwidth is not an issue, but it slows down other peoples downloads.
You can see the site at http://www.sinclair-heaven.net
Its still being developed, so be patient!
Excuse my French ( ) but that's some amazing shit! How the hell do you have access to over 25GB of space and no badnwidth issues??!
Regarding, PDF's, I assume at the moment they are just a collection of image files?
To make them searchable you'd have to OCR then, which is easier on the older issues before they started putting mad colourful design behind the text.
It would be great to have all the issues OCRed and searchable but it's a massive job since OCR software hasn't quite got human like AI at the moment
Regarding, PDF's, I assume at the moment they are just a collection of image files?
To make them searchable you'd have to OCR then, which is easier on the older issues before they started putting mad colourful design behind the text.
It would be great to have all the issues OCRed and searchable but it's a massive job since OCR software hasn't quite got human like AI at the moment
-
- Techno Teaboy
- Posts: 16
- Joined: Sat Mar 05, 2005 12:40 pm
- Location: Nottingham
- Contact:
I have a project running at the moment to OCR all magazines on my site. I'm starting with reviews as it gives people something to aim for. Its pretty boring just ocring pages and it takes so long, people are likely to just give up. So at the moment, they are just pdf's of the scans. As the site is around 90% php scripted it takes a while to do the pdf, then move everything into the correct folders so the scripts work as intended. I tend to do around 10 a day.
I feel sorry for Mort as its even more of a pain to scan them in the first place! At least the OCR software can read his scans.
As for the bandwidth - you might notice my site is a bit slow. I had 53 magazine downloads today, and at about 30mb per magazine, it can slow things down. I am hosting from one of my home PCs, so the only restrictions I have is NTL's upload speed. Its a bit crap, but its free!
I have just installed a 200gb SATA drive, so space is never going to be an issue. This is why I have offered loads of people sections on the site (gambase & sam coupe being two of them). It pads the site out & all I need to do is drop the files in.
Like I said - the site is still being developed & some of the scripts I am developing are also becoming commercial, so I need to concentrate on those first. In particular, my stats package & forum. Just takes ages to write them & test them (thank god for Wos users!)
Lastly - you have a strange grasp of French. It looks English to me. Unless this forum has a babel converter. (Now theres an idea!)
I feel sorry for Mort as its even more of a pain to scan them in the first place! At least the OCR software can read his scans.
As for the bandwidth - you might notice my site is a bit slow. I had 53 magazine downloads today, and at about 30mb per magazine, it can slow things down. I am hosting from one of my home PCs, so the only restrictions I have is NTL's upload speed. Its a bit crap, but its free!
I have just installed a 200gb SATA drive, so space is never going to be an issue. This is why I have offered loads of people sections on the site (gambase & sam coupe being two of them). It pads the site out & all I need to do is drop the files in.
Like I said - the site is still being developed & some of the scripts I am developing are also becoming commercial, so I need to concentrate on those first. In particular, my stats package & forum. Just takes ages to write them & test them (thank god for Wos users!)
Lastly - you have a strange grasp of French. It looks English to me. Unless this forum has a babel converter. (Now theres an idea!)
It can?!! The OCR software I am/was using TextBridge, couldn't make out his scans at 957xwhatever resolution. I found I needed to rescan the pages at around 2000x4000 or something to get a clean run at OCRing them. Very boring job alright! Maybe we could outsource it to India or something? Did they get english copies of zzap there in the 80;s?fogartylee wrote: At least the OCR software can read his scans.
-
- Techno Teaboy
- Posts: 16
- Joined: Sat Mar 05, 2005 12:40 pm
- Location: Nottingham
- Contact:
I'm using omnipage pro 14, and its great. I need to reformat some pages after, but can't complain. I've seen the same results with textbridge & its not that different. Have you tried saving the output to a word document?
I can't speak for your strange magazines, but the sinclair ones are easy enough.
By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!
I can't speak for your strange magazines, but the sinclair ones are easy enough.
By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!
- Lloyd Mangram
- King of Ludlow
- Posts: 1152
- Joined: Thu Jun 19, 2003 10:22 pm
- Location: Ludlow
- Contact:
That's because Speccies suck, of course, and Commodore rules.fogartylee wrote:
By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!
Once again I emerge from beneath a massive pile of paper which makes my desk groan to bring you the world’s most amazing posts.
-
- Techno Teaboy
- Posts: 16
- Joined: Sat Mar 05, 2005 12:40 pm
- Location: Nottingham
- Contact:
While Mort's scans are fine for reading with the human eye, they don't have enough detail for the OCRing, the output text comes out with a LOAD of errors. But if I rescan the page in a higher resolution and then OCR, there's a lot less errors. But as regards later issues with lots of crappy multicolour backgrounds...... it's a dead loss.fogartylee wrote:I'm using omnipage pro 14, and its great. I need to reformat some pages after, but can't complain. I've seen the same results with textbridge & its not that different. Have you tried saving the output to a word document?
I can't speak for your strange magazines, but the sinclair ones are easy enough.
Do you photoshop the scans before you OCR them?
-
- Techno Teaboy
- Posts: 16
- Joined: Sat Mar 05, 2005 12:40 pm
- Location: Nottingham
- Contact:
No. These are 'raw' results from one of Morts scans. I didn't do these by the way, but I have done the omnipage one and got the same results:
http://www.sinclair-heaven.net/crash_omnipage.zip
http://www.sinclair-heaven.net/crash_textbridge.zip
These were both cut n paste jobs into a word document.
http://www.sinclair-heaven.net/crash_omnipage.zip
http://www.sinclair-heaven.net/crash_textbridge.zip
These were both cut n paste jobs into a word document.
- Lloyd Mangram
- King of Ludlow
- Posts: 1152
- Joined: Thu Jun 19, 2003 10:22 pm
- Location: Ludlow
- Contact:
That looks quite interesting.fogartylee wrote:No. These are 'raw' results from one of Morts scans. I didn't do these by the way, but I have done the omnipage one and got the same results:
http://www.sinclair-heaven.net/crash_omnipage.zip
http://www.sinclair-heaven.net/crash_textbridge.zip
These were both cut n paste jobs into a word document.
Imagine all Crash & Zzap (etc) text in a database, together with a search engine, now that would be ideal!
Once again I emerge from beneath a massive pile of paper which makes my desk groan to bring you the world’s most amazing posts.
-
- Techno Teaboy
- Posts: 16
- Joined: Sat Mar 05, 2005 12:40 pm
- Location: Nottingham
- Contact:
I'm planning on doing some scanning and OCRing in the near future of various 80's and 90's magazines I have at hand (ZZap!64, PCG, CD32, The One, Amiga Format etc.) and was after some general advice on scanning/OCRing please.
Space for me is no issue (large HDD) I'd just like to keep the 'best' quality copies I can as well as have them easily searchable. Any suggestions most welcome re: procedure, settings and software.
Space for me is no issue (large HDD) I'd just like to keep the 'best' quality copies I can as well as have them easily searchable. Any suggestions most welcome re: procedure, settings and software.
Well for OCRing, I use TextBridge Pro. I scan the pages at 300dpi or so to give a horizontal resolution of over 2000 pixels.
Coloured backgrounds or especially changing background can really screw up the OCRing, so sometimes I have to use Paint Shop Pro to colour replace them to just bare white etc.
Then it's time to actually do the OCRing, which by this stage is fairly painless, although it takes a while to format the output text.
It's a very slow, boring process unfortunately, but it's worth it in the end I guess!
Feel free to OCR any Zzap stuff for this site! Just make sure it hasn't been done already first.
Coloured backgrounds or especially changing background can really screw up the OCRing, so sometimes I have to use Paint Shop Pro to colour replace them to just bare white etc.
Then it's time to actually do the OCRing, which by this stage is fairly painless, although it takes a while to format the output text.
It's a very slow, boring process unfortunately, but it's worth it in the end I guess!
Feel free to OCR any Zzap stuff for this site! Just make sure it hasn't been done already first.