|FROM ||Ruben Safir
|SUBJECT ||Subject: [NYLXS - HANGOUT] OCR Patents
|I thought that the OCR software was so patent incumbered that it was
nearly impossible to release under a free software license.
Technology By Catherine Holahan
Google Seeks Help with Recognition In its quest to create a vast online
library, the search titan has released character-recognition scanning
software to the open-source crowd
Printer-Friendly Version E-Mail This Story Reader Comments POLL INSTANT
SURVEY >> With which of the following statements on outsourcing do you
The benefits of outsourcing to corporate America far outweigh
the costs There's an even split between the drawbacks and rewards
Any benefits are overshadowed by the loss of U.S. jobs Unsure
VIEW POLL RESULTS >>
Search for business contacts: First Name : Last Name : Company Name :
PREMIUM SEARCH Search by job title, geography and build a list of
executive contacts Search by Zoominfo
Tech White Papers
Most Recent Most Popular
Google does not shy away from gargantuan projects. The search giant
is known, after all, for indexing the World Wide Web and mapping the
entire planet in three dimensions. But the company's latest endeavor
may be too big for even the Internet goliath to complete alone.
Google (GOOG) wants to index all the world's printed material for
inclusion in a comprehensive online library. To that end, it launched
Google Books, a service featuring free-to-download classics and excerpts
from copyrighted works, and Google Scholar, a database of academic
and scientific research (see BusinessWeek.com, 8/31/06, "Google Offers
Classics for Free").
On Sept. 6, it added another service to further its goal: Google News
Archive. The new application allows computer users to search back
issues of various publications, such as The Washington Post (WPO), The
New York Times (NYT), and The Wall Street Journal (DJ). Articles from
some journals date as far back as 200 years and, in some cases, must be
purchased from the original publication (see BusinessWeek.com, 9/06/06,
"Google Digs Into the Archives").
Despite its relentless release of virtual library material, however,
Google is asking the greater engineering community for help developing
the technology it needs to index and archive all published works.
COMMUNITY EFFORT. On Aug. 30, Google's "üech lead" Luc Vincent announced
that the company was turning to the tech community for help improving
Optical Character Recognition (OCR) technology, which enables computers
to decipher words in scanned texts. The first step: Google debugged an
old Hewlett-Packard OCR engine, named Tesseract, that HP had released
to university researchers in Nevada. Before that, the application had
sat idle at Hewlett-Packard (HP) since 1995, when the company decided
to leave the OCR business and concentrate on its line of home office
products, computers, printers, and cameras.
Google then released the cleaned version to the open-source
community. Bdale Garbee, chief technologist at Hewlett-Packard, says
the company is pleased others will build off its efforts. "We're happy
to see good code being put to good use," he says, "and we look forward
to seeing where the community takes this technology in the future."
Google is hoping they take the technology far beyond its current
capabilities, says Chris DiBona, Google's open-source program manager. OCR
technology is central to Google's cause because it enables search
engines to "read" documents. Without OCR, the computer sees a scanned
page of print only as an image and cannot find keywords or phrases in
the text. In the search world, OCR means the difference between being
able to find a book only if you know the complete title and being able
to find it if all you know is a few key quotes.
Because it was essentially abandoned, the program's capabilities badly lag
the standards of current commercial OCR engines. Tesseract has trouble
reading gray scale and text with background color, for example. Google,
however, sees promise that the technology community, by tinkering with
formerly proprietary coding within Tesseract, will be able to come up
with some solutions to problems that plague even the paid technology.
LOOKING TO LEAP FORWARD. DiBona says the OCR engines out there
are 99.5% accurate at reading Latin characters, but still have some
trouble with other languages, handwriting, highly stylized fonts, and
unique layouts. In the past, Google has had some problems with blurry or
off-center scans that can sometimes confuse the OCR engines. For example,
a poorly scanned book with blurry characters could prevent the OCR engine
from deciphering the letters and words in a document. Thus, that page
would not be properly indexed by searches (see BusinessWeek.com, 12/22/05,
"Google's Great Works in Progress").
"If you look at OCR over the past 10 years, not much has happened. There
are some programs out there that are pretty good, but we wanted to see
if by putting OCR out there we could improve it," DiBona says, adding
it would be "really good if OCR gets better for everybody."
As more offices began moving from paper to digital, they needed OCR
technology to help computers recognize the text in their scanned documents
and allow them to edit the new digital versions. Over the past three
years, search engines and other online companies expanded the use of
OCR by applying the technology to search, says Robert Weideman, senior
marketing vice-president for Nuance Communications (NUAN), maker of one
of the market-leading OCR engines, OmniPage. Both Google and Amazon
(AMZN), for example, use OCR technology to match search phrases with
specific passages in books.
OCR PROLIFERATION. But who uses OCR outside of online search and
commerce? Well, increasingly, everybody who has ever scanned a document
or read a scanned document. "When you think about who touches on our OCR
technology, it is literally millions of people worldwideany industry that
deals with paper uses OCR," Weideman says. Nuance experienced 8% growth
and reaped more than $70 million in revenues from the OCR digital imaging
business last year, Weideman says, which includes PDF conversion software.
Those profits may seem surprising for a technology that, at first, didn't
seem to have many practical applications. When inventor Raymond Kurzweil
created the first OCR system in 1974, he struggled to find a use for it
(see BusinessWeek.com, 5/02/01, "How Ray Kurzweil Keeps Changing the
World"). The mass-market answer eventually came in the form of a scanner.
Nuance has provided OCR technology to Google, though a confidentiality
agreement keeps it from saying whether its OCR systems power Google's
current book search. The company has also supplied Microsoft (MSFT) with
OCR technology for its upcoming XPS programa PDF competitor to Adobe's
(ADBE) Acrobatwhich will be included in the new Vista operating system.
FOLLOW THE LEADER. Now OCR is built into many scanners and comes
standard on some computers. It even has become incorporated into cell
phones via software that allows people to take pictures of text, such
as business cards, for example, and then index the pertinent words in
their address books.
Whether Google's open-source release will lead to better OCR technology
and more future users is uncertain. Analysts say Nuance has built such
a big lead in the industry that competitors remain also-rans, unlikely
to contribute major advances any time soon. "For Nuance, they are leaps
and bounds ahead of any of their competitors in the space," says Daniel
Ives, an equity analyst at Friedman, Billings, Ramsey & Co., a Web-based
investment bank based in Arlington, Va. "Nuance dominates the industry,"
adds Jeff Van Rhee, an equity research analyst at Craig-Hallum Capital
in Minneapolis. "I don't see [Tesseract] being a big impact."
Over time, however, Ives sees the demand for OCR and related technology
increasing as more people seek to switch effortlessly between the paper
and digital worlds. "There's no doubt that we believe it is going to be
a very fertile market area," he says.
So fertile, in fact, Google is turning to OCR for help moving the world's
print into a digital archive of everything. -- __________________________
http://www.mrbrklyn.com http://www.nylxs.com - Leadership Development
in Free Software
So many immigrant groups have swept through our town that Brooklyn, like
Atlantis, reaches mythological proportions in the mind of the world -
RI Safir 1998
DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002
"Yeah - I write Free Software...so SUE ME"
"The tremendous problem we face is that we are becoming sharecroppers
to our own cultural heritage -- we need the ability to participate in
our own society."
"> I'm an engineer. I choose the best tool for the job, politics be
You must be a stupid engineer then, because politcs and technology have
been attacted at the hip since the 1st dynasty in Ancient Egypt.
I guess you missed that one."