I mean, is there a good enough automatic algorithm that can try to guess if part of the memory contains executable code or if its just data?
!login ou Inscrivez-vous pour poster
I mean, is there a good enough automatic algorithm that can try to guess if part of the memory contains executable code or if its just data?
I think you just need to make an educated guess. Many times its not too hard to see that memory is data.
well, normally, when I look at the dissassembled code, it's also easy:
T26DAh 1A...... .... LD A,(DE) T26DBh 74...... t... LD (HL),H T26DCh 34...... 4... INC (HL) ; T26DDh 73...... s... LD (HL),E T26DEh 00...... .... NOP T26DFh 84...... „... ADD A,H T26E0h 04...... .... INC B ; T26E1h 83...... ƒ... ADD A,E
But the problem is, how could I try to guess using something not too exotic, like neural networks!
It looks more like data. Loading H to address (HL), then increment the value and then reassign it to E doesn't look like something useful (unless its something very tricky). Also that nop seems quite unnessecary.
Have you tried to run the code in an emulator with debugger to see if the memory area change?
Also, you have a decent neural network between your ears. You can use that one
if the hexa strings has no sense for code, it is data.
by example
when you reads code..... you will see things like (all in hexa)
3A yy yy 32 yy yy
21 xx xx 11 xx xx CD yy yy 2A yy yy
CD yy yy C3 yy yy
D3 I/O DB I/O
ED B0
DD 21 xx xx
FD 21 xx xx
20 rr
18 rr
things like that, is to say, you can easily recognize when it is real code looking for the opcodes very common of 3 or more bytes long!..
offcourse that yy yy must be a nice 16 bit memory address reasonable (in the range of the memory page used for code)
rr (in most cases to be a little number in two's complement)
and I/O a port number of a standard msx device like 98 99 a0 a1 ..... etc etc
on the other side, if you see things like 8x 8x 8x 8x 8x 9x 9x 9x 7x 7x 7x it is DATA,,,, normally in z80, the graphic and sound data is uncompressed, so you can visualize just seeing the hexas what graphics IS (if they are pattern tiles, or sprites or color, or it is sound wave, or some format like PT3, MIDI etc.. to diferentiate, you needs to thinks FIRST what soft are you seeing, and what you can to expect to be in the box.
if you see hexas that repeats periodicaly, or consecutives number of some range like
xx 44 xx 43 xx 42 xx 48 xx .......
20 28 21 22 20 1f 1f 18 18 ....
xx xx 03 xx xx 03 xx xx 04 .....
or any pattern like that, they are just tables with a register of 1, 2, or 3 bytes long one or two fields
if you see
6x 6x 6x 6x 6x 7x 7x 7x 6x 6x 7x 20 ........ it is just TEXT.
and so on.......
pattern tiles, by example, you can recognize it because when is drawing a alphabet, it will have a "00" each 8 bytes!
and when it is game graphics, you can see the form of what is it! because you will have sequences like
03 07 0F 1F 3E 7C .....
or lots of 00 00 00 , or FF FF FF
is to say, each byte has something to match with the previous byte in a way to draw some figure (otherwise it will be just visual-noise on the screen, like when you dump ramdon data to VRAM)
indeed the "FF" is the hexa that is the more used in data and very few times in the code..... because the only one use in CODE for "FF" is RST 38, that in games it will appears just one time.
or when you have a variable's address or routine adress that match with xxFFh, and is one posibility in 256 each time for each object (variable or routine) to link. very uncommon.
in code "00" it is not that uncommon, but you will see it more like :
3E 00
21 00 00
11 00 00
01 00 00
DD 21 00 00
FD 21 00 00
46 00
56 00
66 00
76 00
4E 00
5E 00
6E 00
7E 00
or CD 00 41
CD 00 C0
or any adress that is frequently used by programmers as a start adress of some code's block
More hardly, is to see relocatable code, it is very uncommon... realy uncommon
so, a code relocatable can be identified because appears things like
CD xx 0x
C3 xx 0x
why?, because most soft is meant to run on PAGE 1 (addr 4000h to 7FFFh), and CALL 0xxx , JP 0xxx has nothing to do there..... but you can see that is code and not other thing because you see logical sequences like
21 xx xx CD xx 0x
CD xx 0x C9
CD xx 0x C3 xx 0x
that determines that IT must be code for sure!
so, near the code segment that you thinks can be relocatable code, you will find a nice table like
xx 00 xx 00 xx 00 xx 00 xx 00 xx 00 xx 00 xx 0
(cont)
xx 01 xx 01 xx 01 xx 01 xx 02 xx 02 xx 02
that is a 16 bits register's long, that the relocatable code needs, basicaly each register is an offset address pointing to the location of the second byte of the CALL & JP opcodes (that is 3 bytes long where 2do & 3rd byte are the address to jump).
Before to run the relocatable code, it is copied in RAM memory, and is used that table for recalculate the final adress of each CALL & JP.
Also, you have a decent neural network between your ears. You can use that one
@dvik, well, actually I just copy/pasted that code to show that when you "see" the code, it's easier to see if it's data or code.
My neural network between my ears is old, but still working!
@flyguille: Thanks, I'll take a look at all your comments and try to came up with something, like analyzing the last 12 bytes and scoring it. > than a defined score = code, lower = data.
Of course it'll always be a guess!
i know a old program is name is DISZILOG.COM and then start with a file you see SUPER CPM DIS-ASSEMBLER VER 4.00
after enter parameter making files with listing , source , table cross reference
in listing you find >> NO EXECUTION PATH TO HERE << if code is not program but datas or other
find here http://www.msxpro.com/aplicativos.html
If we have some address X and we want to know if that belongs to code or data, first thing is to find is it ML command or parameter to ML command. This can be done by making table with lenght of every command.
Then you need something like:
;INPUT HL=BYTE THAT WE ARE INTERESTED IN ;OUTPUT Z-FLAG = CODE BYTE NZ-FLAG = DATABYTE CHECK: LD (TEMP),HL LD DE,-30 ADD HL,DE .LOOP: CALL NEXTCMD LD DE,(TEMP) RST #20 RET Z RET C JR .LOOP TEMP: DW 0 ; MOVE HL POINTER TO NEXT COMMAND NEXTCMD: LD B,0 CALL LENGHT ADD HL,BC RET ;LENGHT TABLE IN #C000-#C2FF ; INPUT HL=ADDRESS ; OUTPUT C= LENGHT LENGHT: LD D,#C0 LD E,(HL) LD A,(DE) LD C,A LD A,E CP #FD JR Z,.INDEX CP #DD JR Z,.INDEX CP #ED RET NZ INC D .INDEX INC D INC HL LD E,(HL) DEC HL LD A,(DE) LD C,A RET
Idea here is that because ML commands have different lenghts, you will most likely end up to opcode byte if you take some bytes back and just add the command lenghts together.
Now that you have routine to find code bytes, check the code bytes near that address (use NEXTCMD) and try to locate unlikely opcodes like:
LD B,B
LD D,D
RST #38
etc.
If you find them you have likely found data.
Now you can improve this routine... Make another table that tells you if the opcode overwrites some register. Then try to find unlikely patterns like:
LD A,(#7364)
LD A,E (Same register overwritten two times in a row)
I hope this helps...
Also, you have a decent neural network between your ears. You can use that one
eh,eh....
Don't you have an account yet? Become an MSX-friend and register an account!